Shinkansen Travel Experience Hackathon¶

  • July 19 at 6:00 AM - July 22 at 6:00 AM
  • Allowed team size: 1-3
  • My ranking: 11/35

Problem Statement:¶

The objective of the Shinkansen Travel Experience hackathon competition is to predict whether a passenger was satisfied or not considering his/her overall experience of traveling on the Shinkansen Bullet Train. The Japanese railway system for high-speed passenger trains is known for being rapid, reliable and consistent. Using machine-learning techniques, participants were required to ascertain how significantly each of the parameters contribute overall travel experience of passengers. The datasets consist of: a) the on-time performance of the trains along with passenger information is published in a file named ‘Traveldata_train.csv’ and, b) surveys collected from a random sample of travellers from the same population as the travel data and record the travellers' post-travel experiences in the file named ‘Surveydata_train.csv’. The survey data contains feedback on the parameters of the travel experience including overall experience, the target variable. The files are separated into train and test files.

Data Dictionary:¶

Travel Data:¶

  • ID: The unique ID of the passenger
  • Gender: The gender of the passenger
  • Customer_Type: Loyalty type of the passenger
  • Age: The age of the passenger
  • Type_Travel: Purpose of travel of the passenger
  • Travel_Class: The train class that the passenger traveled in
  • Travel_Distance: The distance traveled by the passenger
  • Departure_Delay_In_Mins: The delay (in minutes) in train departure
  • Arrival_Delay_In_Mins: The delay (in_minutes) in train arrival accuracy 0%.

Survey Data:¶

  • Column Name: Column description
  • ID: The unique ID of the passenger
  • Platform_Location: How convenient the location of the platform is for the passenger
  • Seat_Class: The type of the seat class on the train
  • Overall_Experience: The overall experience of the passenger
  • Seat_Comfort: The comfort level of the seat for the passenger
  • Arrival_time_Convenient: How convenient the arrival time of the train is for the passenger
  • Catering: How convenient the catering service is for the passenger
  • Onboard_Wi-Fi_Service: The quality of the onboard Wi_Fi service for the passenger
  • Onboard_Entertainment: The quality of the onbaord entertainment for the passenger
  • Online_Support: The quality of the online support for the passenger
  • Ease_of_Online Booking: The level of ease of booking a trip online
  • Onboard_Service: The quality of service onboard for the passenger
  • Legroom: The convenience of the legroom provided for the passenger
  • Baggage_Handling: The convenience of the handling of baggage for the customer
  • CheckIn_Service: The convenience of the check-in service for the passenger
  • Cleanliness: The passenger's view of the cleanliness of the service
  • Online_Boarding: The convenience of the online boarding process for the passenger

Evaluation Criteria:¶

The evaluation metric is the accuracy score of the model i.e., the percentage of predictions made by the model that turned out to be correct. The score is calculated as the total number of correct predictions, True Positives plus True Negatives, divided by the total number of observations. The highest possible accuracy is 100% (or 1) whilst the worst possible accuracy is 0%. Since the problem is a real-world machine learning classification problem, the benchmark accuracy score is approximately 95.00%. My goal in this project/competition was therefore to achieve a higher score than the benchmark.

Approach to arrive at the insights and recommendations:¶

  1. Importing the necessary libraries

  2. Reading in the dataset to get an overview

  3. Conducting exploratory data analysis - a. Univariate, b. Bi & Multi-variate, c. Answering questions about particular variables of interest

  4. Preparing the data

  5. Defining the performance metric

  6. Building the Machine Learning models, checking the performance and feature importances, tuning the models where necessary and running the predictions

  7. Recording the observations

  8. Comparing the model performances

  9. Choosing the best model for deployment

  10. Summarising the key observations, business insights and recommendations

1. Importing the necessary libraries¶

In [8]:
# Importing the library packages
import pandas as pd  # library used for data manipulation and analysis
import numpy as np  # library used for working with arrays
import matplotlib.pyplot as plt  # library for plots and visualizations
import seaborn as sns  # library for visualizations
sns.set

# Suppressing warnings
import warnings
warnings.filterwarnings('ignore')

# Importing machine learning models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

# Importing additional functions from Scikit-Learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

# Importing functions to generate different metric scores
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,precision_recall_curve,roc_curve,make_scorer,recall_score,accuracy_score

2. Reading in the dataset to get an overview¶

In [10]:
# Loading the data set 
df_survey = pd.read_csv('Surveydata_train.csv')
df_survey_test = pd.read_csv('Surveydata_test.csv')
df_travel = pd.read_csv('Traveldata_train.csv')
df_travel_test = pd.read_csv('Traveldata_test.csv')
In [11]:
df_survey.head()
Out[11]:
ID Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
0 98800001 0 Needs Improvement Green Car Excellent Excellent Very Convenient Good Needs Improvement Acceptable Needs Improvement Needs Improvement Acceptable Needs Improvement Good Needs Improvement Poor
1 98800002 0 Poor Ordinary Excellent Poor Needs Improvement Good Poor Good Good Excellent Needs Improvement Poor Needs Improvement Good Good
2 98800003 1 Needs Improvement Green Car Needs Improvement Needs Improvement Needs Improvement Needs Improvement Good Excellent Excellent Excellent Excellent Excellent Good Excellent Excellent
3 98800004 0 Acceptable Ordinary Needs Improvement NaN Needs Improvement Acceptable Needs Improvement Acceptable Acceptable Acceptable Acceptable Acceptable Good Acceptable Acceptable
4 98800005 1 Acceptable Ordinary Acceptable Acceptable Manageable Needs Improvement Good Excellent Good Good Good Good Good Good Good
In [12]:
df_survey.shape
Out[12]:
(94379, 17)
In [13]:
df_survey_test.head()
Out[13]:
ID Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking Onboard_Service Legroom Baggage_Handling CheckIn_Service Cleanliness Online_Boarding
0 99900001 Acceptable Green Car Acceptable Acceptable Manageable Needs Improvement Excellent Good Excellent Excellent Excellent Excellent Good Excellent Poor
1 99900002 Extremely Poor Ordinary Good Poor Manageable Acceptable Poor Acceptable Acceptable Excellent Acceptable Good Acceptable Excellent Acceptable
2 99900003 Excellent Ordinary Excellent Excellent Very Convenient Excellent Excellent Excellent Needs Improvement Needs Improvement Needs Improvement Needs Improvement Good Needs Improvement Excellent
3 99900004 Acceptable Green Car Excellent Acceptable Very Convenient Poor Acceptable Excellent Poor Acceptable Needs Improvement Excellent Excellent Excellent Poor
4 99900005 Excellent Ordinary Extremely Poor Excellent Needs Improvement Excellent Excellent Excellent Excellent NaN Acceptable Excellent Excellent Excellent Excellent
In [14]:
df_survey_test.shape
Out[14]:
(35602, 16)
In [15]:
df_travel.head()
Out[15]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
0 98800001 Female Loyal Customer 52.0 NaN Business 272 0.0 5.0
1 98800002 Male Loyal Customer 48.0 Personal Travel Eco 2200 9.0 0.0
2 98800003 Female Loyal Customer 43.0 Business Travel Business 1061 77.0 119.0
3 98800004 Female Loyal Customer 44.0 Business Travel Business 780 13.0 18.0
4 98800005 Female Loyal Customer 50.0 Business Travel Business 1981 0.0 0.0
In [16]:
df_travel.shape
Out[16]:
(94379, 9)
In [17]:
df_travel_test.head()
Out[17]:
ID Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
0 99900001 Female NaN 36.0 Business Travel Business 532 0.0 0.0
1 99900002 Female Disloyal Customer 21.0 Business Travel Business 1425 9.0 28.0
2 99900003 Male Loyal Customer 60.0 Business Travel Business 2832 0.0 0.0
3 99900004 Female Loyal Customer 29.0 Personal Travel Eco 1352 0.0 0.0
4 99900005 Male Disloyal Customer 18.0 Business Travel Business 1610 17.0 0.0
In [18]:
df_travel_test.shape
Out[18]:
(35602, 9)

I will merge the 'survey' and 'travel data' datasets and then investigate which columns are relevant to our task.

In [20]:
# Creating the train dataset by merging the survey data with the travel data
df_train = pd.merge(df_survey, df_travel.drop_duplicates(['ID']), on="ID", how="left")
df_train.head()
Out[20]:
ID Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support ... Cleanliness Online_Boarding Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
0 98800001 0 Needs Improvement Green Car Excellent Excellent Very Convenient Good Needs Improvement Acceptable ... Needs Improvement Poor Female Loyal Customer 52.0 NaN Business 272 0.0 5.0
1 98800002 0 Poor Ordinary Excellent Poor Needs Improvement Good Poor Good ... Good Good Male Loyal Customer 48.0 Personal Travel Eco 2200 9.0 0.0
2 98800003 1 Needs Improvement Green Car Needs Improvement Needs Improvement Needs Improvement Needs Improvement Good Excellent ... Excellent Excellent Female Loyal Customer 43.0 Business Travel Business 1061 77.0 119.0
3 98800004 0 Acceptable Ordinary Needs Improvement NaN Needs Improvement Acceptable Needs Improvement Acceptable ... Acceptable Acceptable Female Loyal Customer 44.0 Business Travel Business 780 13.0 18.0
4 98800005 1 Acceptable Ordinary Acceptable Acceptable Manageable Needs Improvement Good Excellent ... Good Good Female Loyal Customer 50.0 Business Travel Business 1981 0.0 0.0

5 rows × 25 columns

In [21]:
# Looking at a few observations in the dataset
df_train.head()
Out[21]:
ID Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support ... Cleanliness Online_Boarding Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
0 98800001 0 Needs Improvement Green Car Excellent Excellent Very Convenient Good Needs Improvement Acceptable ... Needs Improvement Poor Female Loyal Customer 52.0 NaN Business 272 0.0 5.0
1 98800002 0 Poor Ordinary Excellent Poor Needs Improvement Good Poor Good ... Good Good Male Loyal Customer 48.0 Personal Travel Eco 2200 9.0 0.0
2 98800003 1 Needs Improvement Green Car Needs Improvement Needs Improvement Needs Improvement Needs Improvement Good Excellent ... Excellent Excellent Female Loyal Customer 43.0 Business Travel Business 1061 77.0 119.0
3 98800004 0 Acceptable Ordinary Needs Improvement NaN Needs Improvement Acceptable Needs Improvement Acceptable ... Acceptable Acceptable Female Loyal Customer 44.0 Business Travel Business 780 13.0 18.0
4 98800005 1 Acceptable Ordinary Acceptable Acceptable Manageable Needs Improvement Good Excellent ... Good Good Female Loyal Customer 50.0 Business Travel Business 1981 0.0 0.0

5 rows × 25 columns

In [22]:
df_train.shape
Out[22]:
(94379, 25)
In [23]:
# Checking the data types of the columns in the dataset
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       94379 non-null  int64  
 1   Overall_Experience       94379 non-null  int64  
 2   Seat_Comfort             94318 non-null  object 
 3   Seat_Class               94379 non-null  object 
 4   Arrival_Time_Convenient  85449 non-null  object 
 5   Catering                 85638 non-null  object 
 6   Platform_Location        94349 non-null  object 
 7   Onboard_Wifi_Service     94349 non-null  object 
 8   Onboard_Entertainment    94361 non-null  object 
 9   Online_Support           94288 non-null  object 
 10  Ease_of_Online_Booking   94306 non-null  object 
 11  Onboard_Service          86778 non-null  object 
 12  Legroom                  94289 non-null  object 
 13  Baggage_Handling         94237 non-null  object 
 14  CheckIn_Service          94302 non-null  object 
 15  Cleanliness              94373 non-null  object 
 16  Online_Boarding          94373 non-null  object 
 17  Gender                   94302 non-null  object 
 18  Customer_Type            85428 non-null  object 
 19  Age                      94346 non-null  float64
 20  Type_Travel              85153 non-null  object 
 21  Travel_Class             94379 non-null  object 
 22  Travel_Distance          94379 non-null  int64  
 23  Departure_Delay_in_Mins  94322 non-null  float64
 24  Arrival_Delay_in_Mins    94022 non-null  float64
dtypes: float64(3), int64(3), object(19)
memory usage: 18.0+ MB

Observations:

The column data types indicate that most of the travel experience and survey parameters are of object type. As we saw in the data overview, the categorical nature of the observations will require further pre-processing to convert the data types into a form our ML models can use. We will look at the numerical and catergorical variables in more detail in the EDA section.

In [25]:
# Checking the missing values in each column
df_train.isna().sum()
Out[25]:
ID                            0
Overall_Experience            0
Seat_Comfort                 61
Seat_Class                    0
Arrival_Time_Convenient    8930
Catering                   8741
Platform_Location            30
Onboard_Wifi_Service         30
Onboard_Entertainment        18
Online_Support               91
Ease_of_Online_Booking       73
Onboard_Service            7601
Legroom                      90
Baggage_Handling            142
CheckIn_Service              77
Cleanliness                   6
Online_Boarding               6
Gender                       77
Customer_Type              8951
Age                          33
Type_Travel                9226
Travel_Class                  0
Travel_Distance               0
Departure_Delay_in_Mins      57
Arrival_Delay_in_Mins       357
dtype: int64
In [26]:
# Checking the missing values in the data percentage-wise
round(df_train.isnull().sum() / df_train.isnull().count() * 100, 2)
Out[26]:
ID                         0.00
Overall_Experience         0.00
Seat_Comfort               0.06
Seat_Class                 0.00
Arrival_Time_Convenient    9.46
Catering                   9.26
Platform_Location          0.03
Onboard_Wifi_Service       0.03
Onboard_Entertainment      0.02
Online_Support             0.10
Ease_of_Online_Booking     0.08
Onboard_Service            8.05
Legroom                    0.10
Baggage_Handling           0.15
CheckIn_Service            0.08
Cleanliness                0.01
Online_Boarding            0.01
Gender                     0.08
Customer_Type              9.48
Age                        0.03
Type_Travel                9.78
Travel_Class               0.00
Travel_Distance            0.00
Departure_Delay_in_Mins    0.06
Arrival_Delay_in_Mins      0.38
dtype: float64

Observations:

The columns with missing values worth noting are:

  • Arrival_Time_Convenient 9.46%
  • Catering 9.2%
  • Onboard_Service 8.05%
  • Customer_Type 9.48%
  • Type_Travel 9.78%

In the EDA, we will see what the potential impact of the missng values could be on our overall analysis and select the appropriate method to deal with the missing values.

In [246]:
# Checking the number of unique values in each column
df_train.nunique()
Out[246]:
ID                         94379
Overall_Experience             2
Seat_Comfort                   6
Seat_Class                     2
Arrival_Time_Convenient        6
Catering                       6
Platform_Location              6
Onboard_Wifi_Service           6
Onboard_Entertainment          6
Online_Support                 6
Ease_of_Online_Booking         6
Onboard_Service                6
Legroom                        6
Baggage_Handling               5
CheckIn_Service                6
Cleanliness                    6
Online_Boarding                6
Gender                         2
Customer_Type                  2
Age                           75
Type_Travel                    2
Travel_Class                   2
Travel_Distance             5210
Departure_Delay_in_Mins      437
Arrival_Delay_in_Mins        434
dtype: int64

Observations from the overview and sanity checks:

  • The merged dataset has 94379 rows and 25 columns.

  • There are missing values in the dataset.

  • The data types contained in the dataset are mainly of object type, 19 consisting of the categorical variables. The continous variables are of type float (3) and int (3).

  • The ID column contains the unique identifier of each lead. I will drop this column as it will not be useful for the purpose of the analysis.

  • The age column contains 75 unique observations of the age of the passengers. Since the aim of the project is to predict the overall experience for each passenger based on the criteria inclduing specific age, I will train the models using the age variable as is. For other data science contexts, e.g., clustering etc. it would be appropriate to group the ages into bins i.e., age categories to help with model interpretation.

  • Similarly, for the other continous variables, I will not group the observations into bins since the models I will be building work best with numeric values.

  • So far the columns that will require transformation will be those 'object' data type.

In [30]:
# Making a list of all categorical variables in the train dataset
cat_cols=['Overall_Experience','Seat_Comfort','Seat_Class', 'Arrival_Time_Convenient','Catering','Platform_Location',
         'Onboard_Wifi_Service','Onboard_Entertainment','Online_Support','Ease_of_Online_Booking','Onboard_Service',
         'Legroom','Baggage_Handling','CheckIn_Service','Cleanliness','Online_Boarding','Gender','Customer_Type',
          'Type_Travel','Travel_Class']
In [31]:
# Converting the data type of each categorical variable to 'category'
for column in cat_cols:
    df_train[column]=df_train[column].astype('category')

df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94379 entries, 0 to 94378
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   ID                       94379 non-null  int64   
 1   Overall_Experience       94379 non-null  category
 2   Seat_Comfort             94318 non-null  category
 3   Seat_Class               94379 non-null  category
 4   Arrival_Time_Convenient  85449 non-null  category
 5   Catering                 85638 non-null  category
 6   Platform_Location        94349 non-null  category
 7   Onboard_Wifi_Service     94349 non-null  category
 8   Onboard_Entertainment    94361 non-null  category
 9   Online_Support           94288 non-null  category
 10  Ease_of_Online_Booking   94306 non-null  category
 11  Onboard_Service          86778 non-null  category
 12  Legroom                  94289 non-null  category
 13  Baggage_Handling         94237 non-null  category
 14  CheckIn_Service          94302 non-null  category
 15  Cleanliness              94373 non-null  category
 16  Online_Boarding          94373 non-null  category
 17  Gender                   94302 non-null  category
 18  Customer_Type            85428 non-null  category
 19  Age                      94346 non-null  float64 
 20  Type_Travel              85153 non-null  category
 21  Travel_Class             94379 non-null  category
 22  Travel_Distance          94379 non-null  int64   
 23  Departure_Delay_in_Mins  94322 non-null  float64 
 24  Arrival_Delay_in_Mins    94022 non-null  float64 
dtypes: category(20), float64(3), int64(2)
memory usage: 5.4 MB

Creating a pandas array of the numerical variables

In [33]:
# Creating the array of numerical columns excluding the ID variable
num_cols=['Age','Travel_Distance','Departure_Delay_in_Mins','Arrival_Delay_in_Mins']

Test data¶

In [35]:
# Creating the test dataset by merging the test survey data with the test travel data
df_test = pd.merge(df_survey_test, df_travel_test.drop_duplicates(['ID']), on="ID", how="left")
df_test.head()
Out[35]:
ID Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support Ease_of_Online_Booking ... Cleanliness Online_Boarding Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
0 99900001 Acceptable Green Car Acceptable Acceptable Manageable Needs Improvement Excellent Good Excellent ... Excellent Poor Female NaN 36.0 Business Travel Business 532 0.0 0.0
1 99900002 Extremely Poor Ordinary Good Poor Manageable Acceptable Poor Acceptable Acceptable ... Excellent Acceptable Female Disloyal Customer 21.0 Business Travel Business 1425 9.0 28.0
2 99900003 Excellent Ordinary Excellent Excellent Very Convenient Excellent Excellent Excellent Needs Improvement ... Needs Improvement Excellent Male Loyal Customer 60.0 Business Travel Business 2832 0.0 0.0
3 99900004 Acceptable Green Car Excellent Acceptable Very Convenient Poor Acceptable Excellent Poor ... Excellent Poor Female Loyal Customer 29.0 Personal Travel Eco 1352 0.0 0.0
4 99900005 Excellent Ordinary Extremely Poor Excellent Needs Improvement Excellent Excellent Excellent Excellent ... Excellent Excellent Male Disloyal Customer 18.0 Business Travel Business 1610 17.0 0.0

5 rows × 24 columns

In [36]:
df_test.shape
Out[36]:
(35602, 24)
In [37]:
df_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35602 entries, 0 to 35601
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       35602 non-null  int64  
 1   Seat_Comfort             35580 non-null  object 
 2   Seat_Class               35602 non-null  object 
 3   Arrival_Time_Convenient  32277 non-null  object 
 4   Catering                 32245 non-null  object 
 5   Platform_Location        35590 non-null  object 
 6   Onboard_Wifi_Service     35590 non-null  object 
 7   Onboard_Entertainment    35594 non-null  object 
 8   Online_Support           35576 non-null  object 
 9   Ease_of_Online_Booking   35584 non-null  object 
 10  Onboard_Service          32730 non-null  object 
 11  Legroom                  35577 non-null  object 
 12  Baggage_Handling         35562 non-null  object 
 13  CheckIn_Service          35580 non-null  object 
 14  Cleanliness              35600 non-null  object 
 15  Online_Boarding          35600 non-null  object 
 16  Gender                   35572 non-null  object 
 17  Customer_Type            32219 non-null  object 
 18  Age                      35591 non-null  float64
 19  Type_Travel              32154 non-null  object 
 20  Travel_Class             35602 non-null  object 
 21  Travel_Distance          35602 non-null  int64  
 22  Departure_Delay_in_Mins  35573 non-null  float64
 23  Arrival_Delay_in_Mins    35479 non-null  float64
dtypes: float64(3), int64(2), object(19)
memory usage: 6.5+ MB
In [38]:
df_test.isna().sum()
Out[38]:
ID                            0
Seat_Comfort                 22
Seat_Class                    0
Arrival_Time_Convenient    3325
Catering                   3357
Platform_Location            12
Onboard_Wifi_Service         12
Onboard_Entertainment         8
Online_Support               26
Ease_of_Online_Booking       18
Onboard_Service            2872
Legroom                      25
Baggage_Handling             40
CheckIn_Service              22
Cleanliness                   2
Online_Boarding               2
Gender                       30
Customer_Type              3383
Age                          11
Type_Travel                3448
Travel_Class                  0
Travel_Distance               0
Departure_Delay_in_Mins      29
Arrival_Delay_in_Mins       123
dtype: int64
In [39]:
# Making a list of all categorical variables in the test dataset
cat_col_test=['Seat_Comfort','Seat_Class', 'Arrival_Time_Convenient','Catering','Platform_Location',
         'Onboard_Wifi_Service','Onboard_Entertainment','Online_Support','Ease_of_Online_Booking','Onboard_Service',
         'Legroom','Baggage_Handling','CheckIn_Service','Cleanliness','Online_Boarding','Gender','Customer_Type',
          'Type_Travel','Travel_Class']
In [40]:
# Converting the data type of each categorical variable to 'category'
for column in cat_col_test:
    df_test[column]=df_test[column].astype('category')

df_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35602 entries, 0 to 35601
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   ID                       35602 non-null  int64   
 1   Seat_Comfort             35580 non-null  category
 2   Seat_Class               35602 non-null  category
 3   Arrival_Time_Convenient  32277 non-null  category
 4   Catering                 32245 non-null  category
 5   Platform_Location        35590 non-null  category
 6   Onboard_Wifi_Service     35590 non-null  category
 7   Onboard_Entertainment    35594 non-null  category
 8   Online_Support           35576 non-null  category
 9   Ease_of_Online_Booking   35584 non-null  category
 10  Onboard_Service          32730 non-null  category
 11  Legroom                  35577 non-null  category
 12  Baggage_Handling         35562 non-null  category
 13  CheckIn_Service          35580 non-null  category
 14  Cleanliness              35600 non-null  category
 15  Online_Boarding          35600 non-null  category
 16  Gender                   35572 non-null  category
 17  Customer_Type            32219 non-null  category
 18  Age                      35591 non-null  float64 
 19  Type_Travel              32154 non-null  category
 20  Travel_Class             35602 non-null  category
 21  Travel_Distance          35602 non-null  int64   
 22  Departure_Delay_in_Mins  35573 non-null  float64 
 23  Arrival_Delay_in_Mins    35479 non-null  float64 
dtypes: category(19), float64(3), int64(2)
memory usage: 2.0 MB
In [41]:
# Creating copies of the train and test datasets as backups 
data_train = df_train.copy()
data_test = df_test.copy()

3. Conducting exploratory data analysis - a. Univariate, b. Bi / Multi-variate, c. Answering questions about particular variables of interest¶

Approach to EDA:

  • Viewing the statistical summaries of the dataset
  • Using the continous and categorical variables arrays grouped above
  • Univariate analysis
  • Bi / Multi-variate analysis
  • Observations and providing answers to key business questions
In [43]:
# Checking the summary statistics of the columns with continous observations
df_train.describe().T
Out[43]:
count mean std min 25% 50% 75% max
ID 94379.0 9.884719e+07 27245.014865 98800001.0 98823595.5 98847190.0 98870784.5 98894379.0
Age 94346.0 3.941965e+01 15.116632 7.0 27.0 40.0 51.0 85.0
Travel_Distance 94379.0 1.978888e+03 1027.961019 50.0 1359.0 1923.0 2538.0 6951.0
Departure_Delay_in_Mins 94322.0 1.464709e+01 38.138781 0.0 0.0 0.0 12.0 1592.0
Arrival_Delay_in_Mins 94022.0 1.500522e+01 38.439409 0.0 0.0 0.0 13.0 1584.0

Observations:¶

The statistical summary shows:

  • Age has 50th percentile of 40 years, the minimum and maximum are 7 and 85, respectively.

  • The travel distance minimum is 50 and maximum 6951.

  • 75% of departures were delayed by 12 minutes

  • 75% of arrivals were delayed by 13 minutes

In [45]:
# Checking the summary of categorical variables 
df_train.describe(exclude = 'number').T
Out[45]:
count unique top freq
Overall_Experience 94379 2 1 51593
Seat_Comfort 94318 6 Acceptable 21158
Seat_Class 94379 2 Green Car 47435
Arrival_Time_Convenient 85449 6 Good 19574
Catering 85638 6 Acceptable 18468
Platform_Location 94349 6 Manageable 24173
Onboard_Wifi_Service 94349 6 Good 22835
Onboard_Entertainment 94361 6 Good 30446
Online_Support 94288 6 Good 30016
Ease_of_Online_Booking 94306 6 Good 28909
Onboard_Service 86778 6 Good 27265
Legroom 94289 6 Good 28870
Baggage_Handling 94237 5 Good 34944
CheckIn_Service 94302 6 Good 26502
Cleanliness 94373 6 Good 35427
Online_Boarding 94373 6 Good 25533
Gender 94302 2 Female 47815
Customer_Type 85428 2 Loyal Customer 69823
Type_Travel 85153 2 Business Travel 58617
Travel_Class 94379 2 Eco 49342

Univariate analysis

Continous data

In [48]:
# Creating the histograms
df_train[num_cols].hist(figsize=(10,8))
plt.show()
No description has been provided for this image

Observations:

  • The age variable tends towards normal distribution with the majority of passenges falling between 20 - 60 years

  • Travel distance is right-skewed indicating that the majority of the passengers travel distances shorter than the median

  • Delays in Departures and Arrivals are similarly postiviely distributed

In [50]:
# Defining the hist_box() function that plots a boxplot and histogram in one visual. 
def hist_box(df, col):
    f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)})
    # Adding a graph in each part
    sns.boxplot(data=df, x=col, ax=ax_box, showmeans=True)
    sns.histplot(data=df, x=col, kde=True, ax=ax_hist)
    plt.show()

1.Age¶

In [52]:
hist_box(df_train,'Age')
No description has been provided for this image
In [53]:
hist_box(df_train,'Travel_Distance')
No description has been provided for this image
In [54]:
# Checking outliers in the Travel Distance column 
df_train[df_train['Travel_Distance']>4300]
Out[54]:
ID Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support ... Cleanliness Online_Boarding Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
51 98800052 0 Poor Ordinary Extremely Poor Extremely Poor Manageable Poor Extremely Poor Poor ... Good Poor Female Loyal Customer 26.0 Business Travel Business 4560 0.0 7.0
79 98800080 0 Needs Improvement Green Car Acceptable Poor Manageable Needs Improvement Needs Improvement Needs Improvement ... Needs Improvement Needs Improvement Male Loyal Customer 25.0 NaN Business 5406 0.0 0.0
112 98800113 1 Good Ordinary Good Good Convenient Good Good Good ... Excellent Good Male Loyal Customer 26.0 Business Travel Business 4615 17.0 6.0
115 98800116 0 Acceptable Green Car Poor Poor Inconvenient Acceptable Acceptable Acceptable ... Good Acceptable Male Loyal Customer 22.0 Business Travel Business 4733 0.0 2.0
133 98800134 0 Needs Improvement Green Car Good Good Convenient Needs Improvement Needs Improvement Needs Improvement ... Acceptable Needs Improvement Female Loyal Customer 24.0 NaN Business 5135 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
94081 98894082 1 Good Ordinary Good Good Convenient Excellent Good Excellent ... Good Excellent Male Loyal Customer 22.0 Business Travel Business 4439 0.0 0.0
94149 98894150 0 Needs Improvement Ordinary Excellent Needs Improvement Convenient Needs Improvement Excellent Needs Improvement ... Acceptable Needs Improvement Female Loyal Customer 25.0 Personal Travel Eco 6655 0.0 0.0
94170 98894171 0 Poor Green Car Good Good Convenient Poor Needs Improvement Poor ... Acceptable Poor Male Loyal Customer 31.0 Business Travel Business 4617 68.0 56.0
94295 98894296 0 Needs Improvement Green Car Needs Improvement Needs Improvement Needs Improvement Needs Improvement Needs Improvement Needs Improvement ... Good Needs Improvement Male Loyal Customer 30.0 Business Travel Business 4927 26.0 25.0
94353 98894354 1 Acceptable Green Car Acceptable Acceptable Manageable Excellent Excellent Good ... Excellent Excellent Female Loyal Customer 28.0 Business Travel Business 4645 0.0 2.0

1938 rows × 25 columns

In [55]:
hist_box(df_train,'Departure_Delay_in_Mins')
No description has been provided for this image
In [56]:
# Checking outliers in the Departure Delay column 
df_train[df_train['Departure_Delay_in_Mins']>20]
Out[56]:
ID Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support ... Cleanliness Online_Boarding Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
2 98800003 1 Needs Improvement Green Car Needs Improvement Needs Improvement Needs Improvement Needs Improvement Good Excellent ... Excellent Excellent Female Loyal Customer 43.0 Business Travel Business 1061 77.0 119.0
14 98800015 0 Acceptable Ordinary Poor Poor Inconvenient Acceptable Acceptable Acceptable ... Needs Improvement Acceptable Male Loyal Customer 33.0 Business Travel Business 1180 49.0 49.0
19 98800020 1 Excellent Green Car Good Good Manageable Good Good Good ... Excellent Good Male Disloyal Customer 24.0 Business Travel Eco 1994 22.0 85.0
30 98800031 0 Acceptable Green Car Acceptable Acceptable Manageable Good Acceptable Excellent ... Good Good Male Loyal Customer 9.0 NaN Eco 2379 100.0 93.0
33 98800034 1 Excellent Ordinary NaN Excellent Needs Improvement Poor Excellent Poor ... Good Poor Male Disloyal Customer 22.0 Business Travel Business 2515 42.0 30.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
94354 98894355 1 Needs Improvement Green Car Good Needs Improvement Needs Improvement Needs Improvement Good Good ... Excellent Good Male Loyal Customer 48.0 Business Travel Business 2179 65.0 54.0
94359 98894360 1 Acceptable Green Car Acceptable Acceptable Manageable Good Excellent Good ... Good Excellent Female Loyal Customer 39.0 Business Travel Business 2418 24.0 22.0
94367 98894368 0 Acceptable Ordinary Good Acceptable Inconvenient Acceptable Needs Improvement Needs Improvement ... Needs Improvement Needs Improvement Male Loyal Customer 14.0 Personal Travel Business 2842 142.0 141.0
94374 98894375 0 Poor Ordinary Good Good Convenient Poor Poor Poor ... Good Poor Male Loyal Customer 32.0 Business Travel Business 1357 83.0 125.0
94378 98894379 0 Acceptable Ordinary Poor Acceptable Manageable Acceptable Acceptable Acceptable ... Good Acceptable Male Loyal Customer 54.0 NaN Eco 2107 28.0 28.0

17655 rows × 25 columns

In [57]:
hist_box(df_train,'Arrival_Delay_in_Mins')
No description has been provided for this image
In [58]:
# Checking outliers in the Arrival Delay column
df_train[df_train['Arrival_Delay_in_Mins']>20]
Out[58]:
ID Overall_Experience Seat_Comfort Seat_Class Arrival_Time_Convenient Catering Platform_Location Onboard_Wifi_Service Onboard_Entertainment Online_Support ... Cleanliness Online_Boarding Gender Customer_Type Age Type_Travel Travel_Class Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
2 98800003 1 Needs Improvement Green Car Needs Improvement Needs Improvement Needs Improvement Needs Improvement Good Excellent ... Excellent Excellent Female Loyal Customer 43.0 Business Travel Business 1061 77.0 119.0
13 98800014 0 Good Ordinary Good Good Manageable Good Excellent NaN ... Acceptable Good Female Loyal Customer 47.0 Personal Travel Eco 1100 20.0 34.0
14 98800015 0 Acceptable Ordinary Poor Poor Inconvenient Acceptable Acceptable Acceptable ... Needs Improvement Acceptable Male Loyal Customer 33.0 Business Travel Business 1180 49.0 49.0
17 98800018 1 Excellent Green Car Excellent Excellent Needs Improvement Excellent Excellent Excellent ... Excellent Excellent Male Loyal Customer 68.0 Personal Travel Eco 3756 20.0 52.0
19 98800020 1 Excellent Green Car Good Good Manageable Good Good Good ... Excellent Good Male Disloyal Customer 24.0 Business Travel Eco 1994 22.0 85.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
94359 98894360 1 Acceptable Green Car Acceptable Acceptable Manageable Good Excellent Good ... Good Excellent Female Loyal Customer 39.0 Business Travel Business 2418 24.0 22.0
94367 98894368 0 Acceptable Ordinary Good Acceptable Inconvenient Acceptable Needs Improvement Needs Improvement ... Needs Improvement Needs Improvement Male Loyal Customer 14.0 Personal Travel Business 2842 142.0 141.0
94371 98894372 0 Poor Ordinary Poor Poor Inconvenient Good Good Acceptable ... Poor Poor Female Loyal Customer 58.0 Business Travel Business 502 0.0 30.0
94374 98894375 0 Poor Ordinary Good Good Convenient Poor Poor Poor ... Good Poor Male Loyal Customer 32.0 Business Travel Business 1357 83.0 125.0
94378 98894379 0 Acceptable Ordinary Poor Acceptable Manageable Acceptable Acceptable Acceptable ... Good Acceptable Male Loyal Customer 54.0 NaN Eco 2107 28.0 28.0

18007 rows × 25 columns

Observations:

  • The mean and median age of the passengers is around 40 years.

  • The mean and median travel distance are just below 2000, the mean being higher than the median distance. There is a significant proportion of outliers in travel distance, with the longest being around 7000.

  • The mean and median values of Delays in Departures and Delays in Arrivals are the same respectively, however both variables have signifant outliers that would need further investigation to understand their contribution towards to overall customer experience.

Categorical data

In [61]:
# Setting the figure size for plots generated with seaborn
import seaborn as sns
sns.set(rc={"figure.figsize": (10,4)})  # width = 10, height = 4
sns.set_palette("tab10")
In [62]:
# Overall Experience
sns.countplot(x = df_train['Overall_Experience'])
plt.show()
No description has been provided for this image
In [63]:
# Seat_Comfort
sns.countplot(x = df_train['Seat_Comfort'])
plt.show()
No description has been provided for this image
In [64]:
# Seat_Class
sns.countplot(x = df_train['Seat_Class'])
plt.show()
No description has been provided for this image
In [248]:
# Arrival_Time_Convenient
sns.countplot(x = df_train['Arrival_Time_Convenient'])
plt.show()
No description has been provided for this image
In [66]:
# Catering
sns.countplot(x = df_train['Catering'])
plt.show()
No description has been provided for this image
In [67]:
# Platform_Location
sns.countplot(x = df_train['Platform_Location'])
plt.show()
No description has been provided for this image
In [68]:
# Onboard_Wifi_Service
sns.countplot(x = df_train['Onboard_Wifi_Service'])
plt.show()
No description has been provided for this image
In [69]:
# Onboard_Entertainment
sns.countplot(x = df_train['Onboard_Entertainment'])
plt.show()
No description has been provided for this image
In [70]:
# Online_Support
sns.countplot(x = df_train['Online_Support'])
plt.show()
No description has been provided for this image
In [71]:
# Ease_of_Online_Booking
sns.countplot(x = df_train['Ease_of_Online_Booking'])
plt.show()
No description has been provided for this image
In [72]:
# Onboard_Service
sns.countplot(x = df_train['Onboard_Service'])
plt.show()
No description has been provided for this image
In [73]:
# Legroom
sns.countplot(x = df_train['Legroom'])
plt.show()
No description has been provided for this image
In [74]:
# Baggage_Handling
sns.countplot(x = df_train['Baggage_Handling'])
plt.show()
No description has been provided for this image
In [75]:
# CheckIn_Service
sns.countplot(x = df_train['CheckIn_Service'])
plt.show()
No description has been provided for this image
In [76]:
# Cleanliness
sns.countplot(x = df_train['Cleanliness'])
plt.show()
No description has been provided for this image
In [77]:
# Online_Boarding
sns.countplot(x = df_train['Online_Boarding'])
plt.show()
No description has been provided for this image
In [78]:
# Gender
sns.countplot(x = df_train['Gender'])
plt.show()
No description has been provided for this image
In [79]:
# Customer_Type
sns.countplot(x = df_train['Customer_Type'])
plt.show()
No description has been provided for this image
In [80]:
# Type_Travel
sns.countplot(x = df_train['Type_Travel'])
plt.show()
No description has been provided for this image
In [81]:
# Travel_Class
sns.countplot(x = df_train['Travel_Class'])
plt.show()
No description has been provided for this image

Observations:

  • Overall_experience: The variable of interest has two classes of moderate unequal distribution. Satisfied customers are in the majority.
  • Seat_Comfort: Ratings of comfort poor, average, needs improvement and good respectively were around 15000 and more for seat comfort. Ratings of excellent were approximately 13000 are and extremely poor 3000 indicating wide distribution in the variable.
  • Seat_Class: Seat classes were equally distributed.
  • Arrival_Time_Convenient: There is a wide distribution of reviews for arrival time convenience. Notably, the sum of ratings 'poor' is closest to those for 'just acceptable' and 'needs improvement'.
  • Catering: There is a wide distribution of ratings for catering. The majority of them neutral to positive i.e., from 'acceptable' onward.
  • Platform_Location: Platform location shows normal distribution of ratings. There are no observations of 'very inconvenient'.
  • Onboard_Wifi: The onboard wifi service rating is somewhat equally distributed amongst customers except for about 10% who rated it as 'poor'.
  • Onboard_Entertainment: There was wide and unequal distribution in the reviews for onboard entertainment. Altogether, at least 50% customers rated it "good" and 'excellent'.
  • Online_support: Similarly as online support, there was more positive sentiment for the level of online support. However around a third of the ratings fell in the 'acceptable' and more negative categories.
  • Ease_of_Online_Booking: The distribution of ratings for ease of online booking was similar to that of online support.
  • Onboard_Service: The distribution of ratings for onboard service was similar to that of ease of online booking.
  • Legroom: The distribution of ratings for legroom was similar to that of onboard service.
  • Baggage: The distribution of customer sentiments for baggage handling was more positively skewed.
  • Checkin_Service: Checkin service ratings were mostly positive, but ratings for 'acceptable' and lower were just below half of total observations.
  • Cleanliness: The distribution of ratings for cleanliness was wide but mostly positive.
  • Online_Boarding: The distribution of ratings for online boarding was wide and mostly positive.
  • Gender: Gender was nearly equally distributed.
  • Customer_Type: There was quite unequal distribution in customer type, 'loyal customer' were in the majority.
  • Type_Travel: Around two-thirds or more of customers travelled for business compared to personal travel.
  • Travel_Class: The travel class was nearly equally distributed amongst business and eco classes.

Bivariate and Multivariate analysis¶

In [84]:
# Finding and visualising the correlation between the numerical variables using a heatmap
# Plotting the correlation between numerical variables
plt.figure(figsize=(15,8))
sns.heatmap(df_train[num_cols].corr(),annot=True, fmt='0.2f', cmap='YlGnBu');
No description has been provided for this image

Checking the relationship between customer sastisfaction i.e., overall experience and the numerical variables

In [86]:
# Mean of numerical variables grouped by status
df_train.groupby(['Overall_Experience'])[num_cols].mean()
Out[86]:
Age Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins
Overall_Experience
0 37.49018 2025.826088 17.738600 18.392374
1 41.01968 1939.962650 12.083107 12.196763

Observations:¶

  • Except the for the delay in arrivals and departures, the heatmap does not show any significant correlation between the predictor variables. As a late departure would naturally cause a delay in arrival, the correlation is to be expected.
  • The average age of customers was 37 in Class 0 and 41 Class 1.
  • There was a ~80 mile diffeence between the average distance traveled by class 0 and class 1 customers.
  • The average duration of delays was roughly the same across classes.
In [88]:
# Let us plot the categorical variables. 
for i in cat_cols:
    if i!='Overall_Experience':
        (pd.crosstab(df_train[i],df_train['Overall_Experience'],normalize='index')*100).plot(kind='bar',figsize=(10,4),stacked=True)
        plt.ylabel('Overall Experience %')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations:¶

  • Seat_Comfort: For class 0 i.e., the dissastisfied customers, overall experience was less related to seat comfort compared to satisfied customers. Interestingly, the distribution of ratings 'Excellent' and 'Extremely Poor' were equal for satisfied customers.
  • Seat_Class: For both booked green car and ordinary seat classes, 40% of customers were of class 0 and 60% of class 1.
  • Arrival_Time_Convenient: Between class 0 and class 1, the distribution of reviews for arrival time convenience were split just below half across the catogories. The parameter showed positive influence on overall experience.
  • Catering: For class 0, catering was more negatively associated with overall experience compared to class 1.
  • Platform_Location: For class 0 customers, platform location had varying ratings with none being strongly related to overall experience. For class 1 on the other hand, a high count of 'very inconvenient' was observed in spite of the customers being satisifed overall.
  • Onboard_Wifi_Service: The onboard wifi service rating was more positively associated with class 1 than class 0.
  • Onboard_Entertainment: Class 0 had high counts of negative and neutral sentiments for onboard entertainment in relation to overall experience. Class 1 had mixed ratings, with high counts 'excellent' and 'good' on one hand and 'extremely poor' on the other hand.
  • Online_Support: For class 0, overall experience was strongly influenced by online support i.e., in the negative direction, whereas for class 1, the influence on overall experience was more positive.
  • Ease_of_Online_Booking: Ease of online booking was more influential on dissatisfation i.e., class 0 customers, with class 1 customers having a more positive overall experience given the parameter.
  • Onboard_Service: For class 0, onboard service was more associated with negative sentiment in terms overall experience compared to class 1.
  • Legroom: For class 0, ratings for legroom were mostly neutral to negative in terms of overall experience. For class 1, the sentiments varied widely e.g., similar counts of 'excellent' and 'extremely poor' were observed.
  • Baggage_Handling: The overall experience for class 0 was neutral to negative in terms of baggage handling. For class 1, the variable had a positive influence on overall experience.
  • CheckIn_Service: Checkin service influence on overall experience was largely negative for class 0 and largely positive for class 1.
  • Cleanliness: Similarly to checkin service, overall experience of class 0 was more influenced by cleanliness compareed to class 1.
  • Online_Boarding: Similarly with cleanliness, the distribution of ratings was wide and mostly positive.
  • Gender: In terms of gender, class 0 consisted of more males than females, i.e., females had a more positive overall experience.
  • Customer_Type: Class 0 comprised of more disloyal customers compared to class 1, i.e., loyal customers had a more positive overall experience.
  • Type_Travel: There was a positive relationship with business travel and overall experience compared to personal travel.
  • Travel_Class: Similarly with type of travel, travel class 'business' was more positively linked with overall experience compared to the 'eco' class.

Model Building Approach:¶

  • Data preparation
  • Partition the data into a train and test set
  • Build a model on the train data
  • Tune the model if required
  • Test the data on the test set

Data Preparation¶

Defining the predictor variables (X) and the target variable (Y)¶

The datasets provided were already split into train and test sets using 70% train and 30% test set stratified sampling. The technique ensures that relative class frequencies are approximately preserved in each train and validation fold.

In [93]:
# Separating the dependent variable and other variables on the train set
X_train=df_train.drop(columns='Overall_Experience')
Y_train=df_train['Overall_Experience']
In [94]:
# Separating the dependent variable and other variables on the test set
X_test=df_test
# Y_test=['Overall_Experience'] 

Missing Values¶

Earlier we identified that our data has missing values. We will impute missing values using median for continuous variables and mode for categorical variables using SimpleImputer.

The SimpleImputer provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located.

In [96]:
X_train.isna().sum()
Out[96]:
ID                            0
Seat_Comfort                 61
Seat_Class                    0
Arrival_Time_Convenient    8930
Catering                   8741
Platform_Location            30
Onboard_Wifi_Service         30
Onboard_Entertainment        18
Online_Support               91
Ease_of_Online_Booking       73
Onboard_Service            7601
Legroom                      90
Baggage_Handling            142
CheckIn_Service              77
Cleanliness                   6
Online_Boarding               6
Gender                       77
Customer_Type              8951
Age                          33
Type_Travel                9226
Travel_Class                  0
Travel_Distance               0
Departure_Delay_in_Mins      57
Arrival_Delay_in_Mins       357
dtype: int64

Imputing missing data¶

In [98]:
si1=SimpleImputer(strategy='median')

median_imputed_col=['Age','Departure_Delay_in_Mins','Arrival_Delay_in_Mins']

# Fit and transform the train data
X_train[median_imputed_col]=si1.fit_transform(X_train[median_imputed_col])

#Transform the test data i.e. replace missing values with the median calculated using training data
X_test[median_imputed_col]=si1.transform(X_test[median_imputed_col])
In [99]:
si2=SimpleImputer(strategy='most_frequent')

mode_imputed_col=['Seat_Comfort','Arrival_Time_Convenient','Catering','Platform_Location',
                  'Onboard_Wifi_Service','Onboard_Entertainment','Online_Support','Ease_of_Online_Booking',
                  'Onboard_Service','Legroom','Baggage_Handling','CheckIn_Service','Cleanliness',
                  'Online_Boarding','Gender','Customer_Type','Type_Travel']

# Fit and transform the train data
X_train[mode_imputed_col]=si2.fit_transform(X_train[mode_imputed_col])

# Transform the test data i.e. replace missing values with the mode calculated using training data
X_test[mode_imputed_col]=si2.transform(X_test[mode_imputed_col])
In [100]:
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print('-'*30)
print(X_test.isna().sum())
ID                         0
Seat_Comfort               0
Seat_Class                 0
Arrival_Time_Convenient    0
Catering                   0
Platform_Location          0
Onboard_Wifi_Service       0
Onboard_Entertainment      0
Online_Support             0
Ease_of_Online_Booking     0
Onboard_Service            0
Legroom                    0
Baggage_Handling           0
CheckIn_Service            0
Cleanliness                0
Online_Boarding            0
Gender                     0
Customer_Type              0
Age                        0
Type_Travel                0
Travel_Class               0
Travel_Distance            0
Departure_Delay_in_Mins    0
Arrival_Delay_in_Mins      0
dtype: int64
------------------------------
ID                         0
Seat_Comfort               0
Seat_Class                 0
Arrival_Time_Convenient    0
Catering                   0
Platform_Location          0
Onboard_Wifi_Service       0
Onboard_Entertainment      0
Online_Support             0
Ease_of_Online_Booking     0
Onboard_Service            0
Legroom                    0
Baggage_Handling           0
CheckIn_Service            0
Cleanliness                0
Online_Boarding            0
Gender                     0
Customer_Type              0
Age                        0
Type_Travel                0
Travel_Class               0
Travel_Distance            0
Departure_Delay_in_Mins    0
Arrival_Delay_in_Mins      0
dtype: int64

Observations:

  • After imputing the missing data, there are no longer any missing values
  • One-hot encoding, since there are several categorical observations which contain strings, we will create dummy variables to continue with modelling.**
In [102]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_test.shape)
(94379, 79) (35602, 74)

Observations:

  • After encoding there are 79 columns in the train data set and 74 columns in the test data set.

Next we will check the number of each sub-feature in the train and datasets to see where they differ

In [105]:
# Checking the index numbers of the columns and their names in the train dataframe and their order
for idx, column_name in enumerate(X_train.columns):
    print(f"Column {idx}: {column_name}")
Column 0: ID
Column 1: Age
Column 2: Travel_Distance
Column 3: Departure_Delay_in_Mins
Column 4: Arrival_Delay_in_Mins
Column 5: Seat_Comfort_Excellent
Column 6: Seat_Comfort_Extremely Poor
Column 7: Seat_Comfort_Good
Column 8: Seat_Comfort_Needs Improvement
Column 9: Seat_Comfort_Poor
Column 10: Seat_Class_Ordinary
Column 11: Arrival_Time_Convenient_Excellent
Column 12: Arrival_Time_Convenient_Extremely Poor
Column 13: Arrival_Time_Convenient_Good
Column 14: Arrival_Time_Convenient_Needs Improvement
Column 15: Arrival_Time_Convenient_Poor
Column 16: Catering_Excellent
Column 17: Catering_Extremely Poor
Column 18: Catering_Good
Column 19: Catering_Needs Improvement
Column 20: Catering_Poor
Column 21: Platform_Location_Inconvenient
Column 22: Platform_Location_Manageable
Column 23: Platform_Location_Needs Improvement
Column 24: Platform_Location_Very Convenient
Column 25: Platform_Location_Very Inconvenient
Column 26: Onboard_Wifi_Service_Excellent
Column 27: Onboard_Wifi_Service_Extremely Poor
Column 28: Onboard_Wifi_Service_Good
Column 29: Onboard_Wifi_Service_Needs Improvement
Column 30: Onboard_Wifi_Service_Poor
Column 31: Onboard_Entertainment_Excellent
Column 32: Onboard_Entertainment_Extremely Poor
Column 33: Onboard_Entertainment_Good
Column 34: Onboard_Entertainment_Needs Improvement
Column 35: Onboard_Entertainment_Poor
Column 36: Online_Support_Excellent
Column 37: Online_Support_Extremely Poor
Column 38: Online_Support_Good
Column 39: Online_Support_Needs Improvement
Column 40: Online_Support_Poor
Column 41: Ease_of_Online_Booking_Excellent
Column 42: Ease_of_Online_Booking_Extremely Poor
Column 43: Ease_of_Online_Booking_Good
Column 44: Ease_of_Online_Booking_Needs Improvement
Column 45: Ease_of_Online_Booking_Poor
Column 46: Onboard_Service_Excellent
Column 47: Onboard_Service_Extremely Poor
Column 48: Onboard_Service_Good
Column 49: Onboard_Service_Needs Improvement
Column 50: Onboard_Service_Poor
Column 51: Legroom_Excellent
Column 52: Legroom_Extremely Poor
Column 53: Legroom_Good
Column 54: Legroom_Needs Improvement
Column 55: Legroom_Poor
Column 56: Baggage_Handling_Excellent
Column 57: Baggage_Handling_Good
Column 58: Baggage_Handling_Needs Improvement
Column 59: Baggage_Handling_Poor
Column 60: CheckIn_Service_Excellent
Column 61: CheckIn_Service_Extremely Poor
Column 62: CheckIn_Service_Good
Column 63: CheckIn_Service_Needs Improvement
Column 64: CheckIn_Service_Poor
Column 65: Cleanliness_Excellent
Column 66: Cleanliness_Extremely Poor
Column 67: Cleanliness_Good
Column 68: Cleanliness_Needs Improvement
Column 69: Cleanliness_Poor
Column 70: Online_Boarding_Excellent
Column 71: Online_Boarding_Extremely Poor
Column 72: Online_Boarding_Good
Column 73: Online_Boarding_Needs Improvement
Column 74: Online_Boarding_Poor
Column 75: Gender_Male
Column 76: Customer_Type_Loyal Customer
Column 77: Type_Travel_Personal Travel
Column 78: Travel_Class_Eco
In [106]:
# Checking the index numbers of the columns and their names in the train dataframe and their order
for idx, column_name in enumerate(X_test.columns):
    print(f"Column {idx}: {column_name}")
Column 0: ID
Column 1: Age
Column 2: Travel_Distance
Column 3: Departure_Delay_in_Mins
Column 4: Arrival_Delay_in_Mins
Column 5: Seat_Comfort_Excellent
Column 6: Seat_Comfort_Extremely Poor
Column 7: Seat_Comfort_Good
Column 8: Seat_Comfort_Needs Improvement
Column 9: Seat_Comfort_Poor
Column 10: Seat_Class_Ordinary
Column 11: Arrival_Time_Convenient_Excellent
Column 12: Arrival_Time_Convenient_Extremely Poor
Column 13: Arrival_Time_Convenient_Good
Column 14: Arrival_Time_Convenient_Needs Improvement
Column 15: Arrival_Time_Convenient_Poor
Column 16: Catering_Excellent
Column 17: Catering_Extremely Poor
Column 18: Catering_Good
Column 19: Catering_Needs Improvement
Column 20: Catering_Poor
Column 21: Platform_Location_Inconvenient
Column 22: Platform_Location_Manageable
Column 23: Platform_Location_Needs Improvement
Column 24: Platform_Location_Very Convenient
Column 25: Onboard_Wifi_Service_Excellent
Column 26: Onboard_Wifi_Service_Extremely Poor
Column 27: Onboard_Wifi_Service_Good
Column 28: Onboard_Wifi_Service_Needs Improvement
Column 29: Onboard_Wifi_Service_Poor
Column 30: Onboard_Entertainment_Excellent
Column 31: Onboard_Entertainment_Extremely Poor
Column 32: Onboard_Entertainment_Good
Column 33: Onboard_Entertainment_Needs Improvement
Column 34: Onboard_Entertainment_Poor
Column 35: Online_Support_Excellent
Column 36: Online_Support_Good
Column 37: Online_Support_Needs Improvement
Column 38: Online_Support_Poor
Column 39: Ease_of_Online_Booking_Excellent
Column 40: Ease_of_Online_Booking_Extremely Poor
Column 41: Ease_of_Online_Booking_Good
Column 42: Ease_of_Online_Booking_Needs Improvement
Column 43: Ease_of_Online_Booking_Poor
Column 44: Onboard_Service_Excellent
Column 45: Onboard_Service_Good
Column 46: Onboard_Service_Needs Improvement
Column 47: Onboard_Service_Poor
Column 48: Legroom_Excellent
Column 49: Legroom_Extremely Poor
Column 50: Legroom_Good
Column 51: Legroom_Needs Improvement
Column 52: Legroom_Poor
Column 53: Baggage_Handling_Excellent
Column 54: Baggage_Handling_Good
Column 55: Baggage_Handling_Needs Improvement
Column 56: Baggage_Handling_Poor
Column 57: CheckIn_Service_Excellent
Column 58: CheckIn_Service_Good
Column 59: CheckIn_Service_Needs Improvement
Column 60: CheckIn_Service_Poor
Column 61: Cleanliness_Excellent
Column 62: Cleanliness_Good
Column 63: Cleanliness_Needs Improvement
Column 64: Cleanliness_Poor
Column 65: Online_Boarding_Excellent
Column 66: Online_Boarding_Extremely Poor
Column 67: Online_Boarding_Good
Column 68: Online_Boarding_Needs Improvement
Column 69: Online_Boarding_Poor
Column 70: Gender_Male
Column 71: Customer_Type_Loyal Customer
Column 72: Type_Travel_Personal Travel
Column 73: Travel_Class_Eco
In [107]:
# Cross-checking the columns
common_columns = ~X_train.columns.isin(X_test.columns).sum()
print(f"Number of common columns: {common_columns}")
Number of common columns: -75

There 75 common columns in the train and test sets. Next we will check for the discrepancies in feature names.

Feature names in the train set and missing in the test set:¶

print(X_train['CheckIn_Service_Extremely Poor'].value_counts()) print(X_train['Cleanliness_Extremely Poor'].value_counts()) print(X_train['Onboard_Service_Extremely Poor'].value_counts()) print(X_train['Online_Support_Extremely Poor'].value_counts()) print(X_train['Platform_Location_Very Inconvenient'].value_counts())

Observations:

  • In the train dataset, there are very few observations of the following ratings: 'CheckIn_Service_Extremely Poor', 'Cleanliness_Extremely Poor', 'Onboard_Service_Extremely Poor', 'Online_Support_Extremely Poor', 'Platform_Location_Very Inconvenient'. Therefore, we will drop these columns in and thus have a matching features in the train and test sets.
In [111]:
# Defining the features to be dropped
columns_to_drop = ['ID',
                   'Platform_Location_Very Inconvenient',
                   'Online_Support_Extremely Poor',
                   'Onboard_Service_Extremely Poor',
                   'CheckIn_Service_Extremely Poor',
                   'Cleanliness_Extremely Poor']
In [112]:
# Dropping the features in the train set
X_train.drop(columns=columns_to_drop, inplace=True)
X_train.head()
Out[112]:
Age Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Seat_Comfort_Excellent Seat_Comfort_Extremely Poor Seat_Comfort_Good Seat_Comfort_Needs Improvement Seat_Comfort_Poor Seat_Class_Ordinary ... Cleanliness_Poor Online_Boarding_Excellent Online_Boarding_Extremely Poor Online_Boarding_Good Online_Boarding_Needs Improvement Online_Boarding_Poor Gender_Male Customer_Type_Loyal Customer Type_Travel_Personal Travel Travel_Class_Eco
0 52.0 272 0.0 5.0 False False False True False False ... False False False False False True False True False False
1 48.0 2200 9.0 0.0 False False False False True True ... False False False True False False True True True True
2 43.0 1061 77.0 119.0 False False False True False False ... False True False False False False False True False False
3 44.0 780 13.0 18.0 False False False False False True ... False False False False False False False True False False
4 50.0 1981 0.0 0.0 False False False False False True ... False False False True False False False True False False

5 rows × 73 columns

In [113]:
# Dropping the features in the test set
X_test.drop(columns='ID', inplace=True)
X_test.head()
Out[113]:
Age Travel_Distance Departure_Delay_in_Mins Arrival_Delay_in_Mins Seat_Comfort_Excellent Seat_Comfort_Extremely Poor Seat_Comfort_Good Seat_Comfort_Needs Improvement Seat_Comfort_Poor Seat_Class_Ordinary ... Cleanliness_Poor Online_Boarding_Excellent Online_Boarding_Extremely Poor Online_Boarding_Good Online_Boarding_Needs Improvement Online_Boarding_Poor Gender_Male Customer_Type_Loyal Customer Type_Travel_Personal Travel Travel_Class_Eco
0 36.0 532 0.0 0.0 False False False False False False ... False False False False False True False True False False
1 21.0 1425 9.0 28.0 False True False False False True ... False False False False False False False False False False
2 60.0 2832 0.0 0.0 True False False False False True ... False True False False False False True True False False
3 29.0 1352 0.0 0.0 False False False False False False ... False False False False False True False True True True
4 18.0 1610 17.0 0.0 True False False False False True ... False True False False False False True False False False

5 rows × 73 columns

Model evaluation criterion¶

The model can make two types of wrong predictions:

  1. Predicting a customer will not be satsified i.e., rate overall experience 0 and the customer rates it 1.
  2. Predicting a customer will rate overall experience 1 and the customer rates it 0, indicating that they were not satisfied.

Which case is more important?

  • The goal of our classification problem is to predict the customers who will rate overall experience 0 or 1. Since customer dissatsifaction represents failure in delivering a good overall experience. Through modelling, the features related to customer satisfaction and dissatisfaction can be uncovered and used as evidence to explore options on how to improve overall customer experience.
  • In other words, in our modelling we seek to achieve the highest Accuracy of prediction so as to have a predictive model that can be applied on an ongoing basis to new customer survey and travel data.
In [115]:
# Creating the metric function

def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Satisfied', 'Not Satisfied'], yticklabels=['Satisfied', 'Not Satisfied'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

We will be building the following models:

  1. Logistic Regression
  2. Support Vector Machine
  3. Decision Tree
  4. Random Forest
  5. AdaBoost
  6. XGBoost

The best performing model will be recommended for deployment together with the list of important features.orest

1) Logistic Regression¶

In [118]:
# Fitting logistic regression model
lg = LogisticRegression()
lg.fit(X_train,Y_train)
Out[118]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [119]:
# Checking the performance on the training data
Y_pred_train = lg.predict(X_train)
metrics_score(Y_train, Y_pred_train)
              precision    recall  f1-score   support

           0       0.85      0.84      0.85     42786
           1       0.87      0.88      0.87     51593

    accuracy                           0.86     94379
   macro avg       0.86      0.86      0.86     94379
weighted avg       0.86      0.86      0.86     94379

No description has been provided for this image

Observations:

  • The logistic regression model achieved an accuracy of 87% on the train dataset. The result is well below the benchmark of 95%.

  • In classification, the class of interest is considered the positive class. In this problem, the class of interest is 1 i.e., the customers who are satisfied and are likely to rate overall experience 1.

  • Reading the confusion matrix (clockwise):

    • True Negative (Actual=0, Predicted=0): Model predicts that a customer rated overall experience 0 and the customer actually rated it 0.
    • False Positive (Actual=0, Predicted=1): Model predicts that a customer rated overall experience 1 and the customer actually rated it 0.
    • False Negative (Actual=1, Predicted=0): Model predicts that a customer rated overall experience 0 and the customer actually rated it 1.
    • True Positive (Actual=1, Predicted=1): Model predicts that a customer rated overall experience 1 and the customer actually rated it 1.
  • The model clearly fails to identify the majority of customers who would be satisfied.

In [121]:
# Predicting on the test dataset
Y_pred_test = lg.predict(X_test)
# metrics_score(Y_test, Y_pred_test)

In the provided data, there is no validation set for predictions of Y, thus we cannot check the model performance before using it to run the prediction.

In [123]:
Y_pred_test
Out[123]:
array([1, 0, 1, ..., 0, 1, 0])

Let's check the coefficients and find which variables are leading to satisfaction:

In [125]:
# Printing the coefficients of logistic regression
cols=X_train.columns

coef_lg=lg.coef_

pd.DataFrame(coef_lg,columns=cols).T.sort_values(by=0,ascending=False)
Out[125]:
0
Onboard_Entertainment_Excellent 2.073871
Seat_Comfort_Excellent 1.505244
Customer_Type_Loyal Customer 1.128552
Onboard_Entertainment_Good 0.965846
Seat_Comfort_Extremely Poor 0.770817
... ...
Seat_Comfort_Needs Improvement -0.628659
Onboard_Entertainment_Poor -0.649767
Onboard_Entertainment_Needs Improvement -0.910343
Gender_Male -1.290781
Travel_Class_Eco -1.630595

73 rows × 1 columns

Observations:¶

According to the logistic regression model, the features that have a the largest positive effect on overall experience are:

  • Onboard_Entertainment_Excellent, is the most important feature in determining customer satisfaction.
  • Seat_Comfort_Excellent, is the second most important feature.
  • Customer_Type_Loyal Customer, is the third most significant feature.
  • Onboard_Entertainment_Good, is the fourth most important feature.
  • Seat_Comfort_Extremely Poor, is the fifth most significant feature.

The features with the largest negative effect on overall experience are:

  • Onboard_Entertainment_Poor
  • Seat_Comfort_Needs Improvement
  • Onboard_Entertainment_Needs Improvement
  • Gender_Male
  • Travel_Class_Eco
In [127]:
# Finding the odds
odds = np.exp(lg.coef_[0])

# Adding the odds to a dataframe and sorting the values
pd.DataFrame(odds, X_train.columns, columns=['odds']).sort_values(by='odds', ascending=False)
Out[127]:
odds
Onboard_Entertainment_Excellent 7.955561
Seat_Comfort_Excellent 4.505252
Customer_Type_Loyal Customer 3.091177
Onboard_Entertainment_Good 2.627009
Seat_Comfort_Extremely Poor 2.161531
... ...
Seat_Comfort_Needs Improvement 0.533306
Onboard_Entertainment_Poor 0.522168
Onboard_Entertainment_Needs Improvement 0.402386
Gender_Male 0.275056
Travel_Class_Eco 0.195813

73 rows × 1 columns

Observations:¶

Having converted the log odds into real odds, we can interpret the results as follows:

  • The odds of a customer who rated Onboard Entertainment as Excellent are 12 times the odds of a customer who did not.
  • The odds of a customer who rated Seat Comfort as Excellent are ~ 7 times the odds of a customer who did not.
  • The odds of a customer who is categorised as a Loyal Customer are ~ 5 times the odds of a customer who was categorised as not being loyal.
  • The odds of a customer who rated Onboard_Entertainment as Good are ~3 times the odds of a customer who did not.
  • The odds of a customer who rated Seat Comfort as Extremely Poor are ~3 times the odds of a customer who did not.

The features with the largest negative effect on overall experience are:

  • Onboard_Entertainment_Poor
  • Seat_Comfort_Needs Improvement
  • Onboard_Entertainment_Needs Improvement
  • Gender_Male
  • Travel_Class_Eco

Precision-Recall curve Next we will find the optimal threshold for the model using the Precision-Recall Curve. The Precision-Recall curve summarizes the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds. Thus using the Precision-Recall curve, we can attempt to find a better threshold.

In [130]:
# Predict_proba gives the probability of each observation belonging to each class

y_scores_lg=lg.predict_proba(X_train)

precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(Y_train, y_scores_lg[:,1])

# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
No description has been provided for this image

Observation:¶

  • We can see that precision and recall are balanced for a threshold of ~0.55.
In [132]:
# Calculating the exact threshold where precision and recall are equal.

for i in np.arange(len(thresholds_lg)):
    if precisions_lg[i]==recalls_lg[i]:
        print(thresholds_lg[i])
0.5170822910653909

Observation:¶

  • We can see that precision and recall are balanced for a threshold of 0.5311028452145516.

Let's find out the performance of the model at this threshold.

In [135]:
# Checking the performance of the model at the threshold
optimal_threshold=0.5170822910653909
Y_pred_train = lg.predict_proba(X_train)
metrics_score(Y_train, Y_pred_train[:,1]>optimal_threshold)
              precision    recall  f1-score   support

           0       0.85      0.85      0.85     42786
           1       0.87      0.87      0.87     51593

    accuracy                           0.86     94379
   macro avg       0.86      0.86      0.86     94379
weighted avg       0.86      0.86      0.86     94379

No description has been provided for this image

Observations:

  • After adjusting the logisitic regression model using the optimal threshold of 0.5075361723618196, the accuracy of the model on the train dataset was unchanged at 0.87.

Let's predict on the test data.

In [138]:
optimal_threshold1=0.5170822910653909
Y_pred_test = lg.predict_proba(X_test)
In [228]:
Y_pred_test
Out[228]:
array([[0.00468911, 0.99531089],
       [0.5404763 , 0.4595237 ],
       [0.02241624, 0.97758376],
       ...,
       [0.90386436, 0.09613564],
       [0.00496998, 0.99503002],
       [0.96424242, 0.03575758]])

2) Support Vector Machines¶

In [230]:
# To Speed-Up SVM training.
scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train_scaled = scaling.transform(X_train)
X_test_scaled = scaling.transform(X_test)

Let's build the models using the two of the widely used kernel functions:

  • Linear Kernel
  • RBF Kernel

2a) Linear Kernel SVM¶

In [233]:
# Fitting SVM
svm = SVC(kernel = 'linear') # Linear kernel or linear decision boundary
model = svm.fit(X = X_train_scaled, y = Y_train)
In [236]:
# Predicting on the train data 
y_pred_train_svm = model.predict(X_train_scaled)

# Checking performance on the train data
metrics_score(Y_train, y_pred_train_svm)
              precision    recall  f1-score   support

           0       0.89      0.90      0.89     42786
           1       0.91      0.91      0.91     51593

    accuracy                           0.90     94379
   macro avg       0.90      0.90      0.90     94379
weighted avg       0.90      0.90      0.90     94379

No description has been provided for this image
In [240]:
# Predicting on the train data 
Y_pred_test_svm = model.predict(X_test_scaled)
# Checking performance on the test data
# metrics_score(Y_test, Y_pred_test_svm)
In [241]:
Y_pred_test_svm
Out[241]:
array([1, 1, 1, ..., 0, 1, 0])
In [242]:
# Exporting the prediction on the test dataset as .csv

Prediction_1 = pd.DataFrame(Y_pred_test_svm)

Submission_1 = pd.concat([df_test['ID'],Prediction_1], axis=1)

Submission_1.columns=['ID', 'Overall_Experience']

# Saving the DataFrame as a CSV file 
# Submission_1.to_csv('Hackathon_Submission_1.csv', index = False)

# Checking the dataframe
Submission_1.head()
Out[242]:
ID Overall_Experience
0 99900001 1
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 1

Observations:

  • With an accuracy of 90%, the SVM model with linear kernel is underperforming relative to the benchmark we are targeting.

2b) RBF Kernel¶

In [250]:
svm_rbf=SVC(kernel='rbf',probability=True)
svm_rbf.fit(X_train_scaled,Y_train)
y_scores_svm=svm_rbf.predict_proba(X_train_scaled) # Predict_proba gives the probability of each observation belonging to each class


precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(Y_train, y_scores_svm[:,1])

# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
No description has been provided for this image
In [254]:
# Calculating the exact threshold where precision and recall are equal.
for i in np.arange(len(thresholds_svm)):
    if precisions_svm[i]==recalls_svm[i]:
        print(thresholds_svm[i])
0.4009583166478404
In [257]:
optimal_threshold1=0.4009583166478404
Y_pred_train = svm_rbf.predict_proba(X_train_scaled)

metrics_score(Y_train, Y_pred_train[:,1]>optimal_threshold1)
              precision    recall  f1-score   support

           0       0.96      0.96      0.96     42786
           1       0.96      0.96      0.96     51593

    accuracy                           0.96     94379
   macro avg       0.96      0.96      0.96     94379
weighted avg       0.96      0.96      0.96     94379

No description has been provided for this image
In [259]:
Y_pred_test = svm_rbf.predict_proba(X_test_scaled)

# metrics_score(Y_test, Y_pred_test[:,1]>optimal_threshold1)
In [261]:
Y_pred_test
Out[261]:
array([[4.06454001e-03, 9.95935460e-01],
       [1.13842294e-02, 9.88615771e-01],
       [7.69186587e-07, 9.99999231e-01],
       ...,
       [5.80178477e-01, 4.19821523e-01],
       [4.25033877e-03, 9.95749661e-01],
       [9.19224339e-01, 8.07756608e-02]])
In [263]:
# Selecting the probability of the desired class
class_1_pred = Y_pred_test[:, 1]
In [265]:
class_1_pred
Out[265]:
array([0.99593546, 0.98861577, 0.99999923, ..., 0.41982152, 0.99574966,
       0.08077566])
In [267]:
# Exporting the prediction on the test dataset as .csv

Prediction_2 = pd.DataFrame(class_1_pred)

Submission_2 = pd.concat([df_test['ID'],Prediction_2], axis=1)

Submission_2.columns=['ID', 'Overall_Experience']

# Rounding the 'Overall_Experience' column to 2 decimal places
Submission_2['Overall_Experience'] = Submission_2['Overall_Experience'].round().astype(int)

# Saving the DataFrame as a CSV file 
# Submission_2.to_csv('Hackathon_Submission_2.csv', index = False)

# Checking the dataframe
Submission_2.head()
Out[267]:
ID Overall_Experience
0 99900001 1
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 1

Observations:

  • At the optimal threshold of 0.4009583166478404, the model performance of the SVM with RBG Kernel had a higher accuracy of 96% compared to the linear kernel on the train data set.

3) Decision Tree¶

In [270]:
# Building decision tree model
model_dt= DecisionTreeClassifier(random_state=1,max_depth=8)
model_dt.fit(X_train, Y_train)
Out[270]:
DecisionTreeClassifier(max_depth=8, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=8, random_state=1)

Let's check the model performance of decision tree

In [272]:
# Checking performance on the training dataset

pred_train_dt = model_dt.predict(X_train)

metrics_score(Y_train, pred_train_dt)
              precision    recall  f1-score   support

           0       0.88      0.91      0.90     42786
           1       0.92      0.90      0.91     51593

    accuracy                           0.90     94379
   macro avg       0.90      0.90      0.90     94379
weighted avg       0.90      0.90      0.90     94379

No description has been provided for this image

Observation:

  • The baseline Decision Tree model has an accuracy of 90% on the training set, below our target of 95%
  • The performance indicates however that the model is not overfitting on the data, therefore we may get a better accuracy by tuning the parameters of the model.

Predicting on the test data and checking performance

In [274]:
pred_test_dt = model_dt.predict(X_test)
# metrics_score(y_test, pred_test_dt)
In [276]:
pred_test_dt 
Out[276]:
array([1, 1, 1, ..., 0, 1, 0])
In [278]:
# Exporting the prediction on the test dataset as .csv

Prediction_3 = pd.DataFrame(pred_test_dt)

Submission_3 = pc_df = pd.concat([df_test['ID'],Prediction_3], axis=1)

Submission_3.columns=['ID', 'Overall_Experience']

# Saving the DataFrame as a CSV file 
#Submission_3.to_csv('Hackathon_Submission_3.csv', index = False)

# Checking the dataframe
Submission_3.head()
Out[278]:
ID Overall_Experience
0 99900001 1
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 1

Let's visualize the decision tree and observe the decision rules:

In [280]:
features = list(X_train.columns)

plt.figure(figsize=(20,20))
from sklearn import tree
tree.plot_tree(model_dt,feature_names=features,max_depth =4, filled=True,fontsize=9,node_ids=True)
plt.show()
No description has been provided for this image

Observations:

The root node of the Decision Tree is Onboard_Entertainment_Excellent <= 0.50. This feature results in the highest information gain. Node #1 Onboard_Entertainment_Good <= 0.50 is the second most influential node in terms of information gain. By following the internal nodes, we can trace the model's decision-making operation by tracing the appropriate branches to the respective leaf nodes which contain the final decision of the tree.

In [282]:
# Checking the weights of the decision tree
print(tree.export_text(model_dt, feature_names=X_train.columns.tolist(), show_weights=True))
|--- Onboard_Entertainment_Excellent <= 0.50
|   |--- Onboard_Entertainment_Good <= 0.50
|   |   |--- Seat_Comfort_Extremely Poor <= 0.50
|   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |--- Seat_Comfort_Good <= 0.50
|   |   |   |   |   |--- Travel_Class_Eco <= 0.50
|   |   |   |   |   |   |--- Ease_of_Online_Booking_Excellent <= 0.50
|   |   |   |   |   |   |   |--- Ease_of_Online_Booking_Good <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [7344.00, 625.00] class: 0
|   |   |   |   |   |   |   |--- Ease_of_Online_Booking_Good >  0.50
|   |   |   |   |   |   |   |   |--- weights: [983.00, 811.00] class: 0
|   |   |   |   |   |   |--- Ease_of_Online_Booking_Excellent >  0.50
|   |   |   |   |   |   |   |--- Customer_Type_Loyal Customer <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [595.00, 20.00] class: 0
|   |   |   |   |   |   |   |--- Customer_Type_Loyal Customer >  0.50
|   |   |   |   |   |   |   |   |--- weights: [186.00, 806.00] class: 1
|   |   |   |   |   |--- Travel_Class_Eco >  0.50
|   |   |   |   |   |   |--- Travel_Distance <= 920.50
|   |   |   |   |   |   |   |--- Type_Travel_Personal Travel <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [697.00, 104.00] class: 0
|   |   |   |   |   |   |   |--- Type_Travel_Personal Travel >  0.50
|   |   |   |   |   |   |   |   |--- weights: [48.00, 167.00] class: 1
|   |   |   |   |   |   |--- Travel_Distance >  920.50
|   |   |   |   |   |   |   |--- Gender_Male <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [5970.00, 600.00] class: 0
|   |   |   |   |   |   |   |--- Gender_Male >  0.50
|   |   |   |   |   |   |   |   |--- weights: [15854.00, 306.00] class: 0
|   |   |   |   |--- Seat_Comfort_Good >  0.50
|   |   |   |   |   |--- Arrival_Time_Convenient_Good <= 0.50
|   |   |   |   |   |   |--- Travel_Class_Eco <= 0.50
|   |   |   |   |   |   |   |--- Customer_Type_Loyal Customer <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [25.00, 56.00] class: 1
|   |   |   |   |   |   |   |--- Customer_Type_Loyal Customer >  0.50
|   |   |   |   |   |   |   |   |--- weights: [463.00, 77.00] class: 0
|   |   |   |   |   |   |--- Travel_Class_Eco >  0.50
|   |   |   |   |   |   |   |--- Baggage_Handling_Good <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [346.00, 192.00] class: 0
|   |   |   |   |   |   |   |--- Baggage_Handling_Good >  0.50
|   |   |   |   |   |   |   |   |--- weights: [236.00, 427.00] class: 1
|   |   |   |   |   |--- Arrival_Time_Convenient_Good >  0.50
|   |   |   |   |   |   |--- Ease_of_Online_Booking_Poor <= 0.50
|   |   |   |   |   |   |   |--- Platform_Location_Manageable <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [299.00, 1146.00] class: 1
|   |   |   |   |   |   |   |--- Platform_Location_Manageable >  0.50
|   |   |   |   |   |   |   |   |--- weights: [69.00, 49.00] class: 0
|   |   |   |   |   |   |--- Ease_of_Online_Booking_Poor >  0.50
|   |   |   |   |   |   |   |--- Age <= 35.50
|   |   |   |   |   |   |   |   |--- weights: [14.00, 11.00] class: 0
|   |   |   |   |   |   |   |--- Age >  35.50
|   |   |   |   |   |   |   |   |--- weights: [43.00, 10.00] class: 0
|   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |--- Legroom_Good <= 0.50
|   |   |   |   |   |--- Onboard_Service_Poor <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 1283.00] class: 1
|   |   |   |   |   |--- Onboard_Service_Poor >  0.50
|   |   |   |   |   |   |--- Catering_Good <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 47.00] class: 1
|   |   |   |   |   |   |--- Catering_Good >  0.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Legroom_Good >  0.50
|   |   |   |   |   |--- Ease_of_Online_Booking_Excellent <= 0.50
|   |   |   |   |   |   |--- weights: [0.00, 402.00] class: 1
|   |   |   |   |   |--- Ease_of_Online_Booking_Excellent >  0.50
|   |   |   |   |   |   |--- Catering_Excellent <= 0.50
|   |   |   |   |   |   |   |--- Travel_Class_Eco <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [24.00, 7.00] class: 0
|   |   |   |   |   |   |   |--- Travel_Class_Eco >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |   |   |   |--- Catering_Excellent >  0.50
|   |   |   |   |   |   |   |--- Online_Boarding_Good <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 24.00] class: 1
|   |   |   |   |   |   |   |--- Online_Boarding_Good >  0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 4.00] class: 1
|   |   |--- Seat_Comfort_Extremely Poor >  0.50
|   |   |   |--- Online_Boarding_Extremely Poor <= 0.50
|   |   |   |   |--- weights: [0.00, 1880.00] class: 1
|   |   |   |--- Online_Boarding_Extremely Poor >  0.50
|   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |--- Onboard_Entertainment_Good >  0.50
|   |   |--- Catering_Good <= 0.50
|   |   |   |--- Ease_of_Online_Booking_Needs Improvement <= 0.50
|   |   |   |   |--- Ease_of_Online_Booking_Poor <= 0.50
|   |   |   |   |   |--- Seat_Comfort_Good <= 0.50
|   |   |   |   |   |   |--- Online_Boarding_Poor <= 0.50
|   |   |   |   |   |   |   |--- Online_Boarding_Needs Improvement <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [726.00, 12009.00] class: 1
|   |   |   |   |   |   |   |--- Online_Boarding_Needs Improvement >  0.50
|   |   |   |   |   |   |   |   |--- weights: [185.00, 237.00] class: 1
|   |   |   |   |   |   |--- Online_Boarding_Poor >  0.50
|   |   |   |   |   |   |   |--- Legroom_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [199.00, 102.00] class: 0
|   |   |   |   |   |   |   |--- Legroom_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 115.00] class: 1
|   |   |   |   |   |--- Seat_Comfort_Good >  0.50
|   |   |   |   |   |   |--- CheckIn_Service_Excellent <= 0.50
|   |   |   |   |   |   |   |--- Cleanliness_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [1041.00, 1450.00] class: 1
|   |   |   |   |   |   |   |--- Cleanliness_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [48.00, 472.00] class: 1
|   |   |   |   |   |   |--- CheckIn_Service_Excellent >  0.50
|   |   |   |   |   |   |   |--- Customer_Type_Loyal Customer <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [16.00, 23.00] class: 1
|   |   |   |   |   |   |   |--- Customer_Type_Loyal Customer >  0.50
|   |   |   |   |   |   |   |   |--- weights: [34.00, 571.00] class: 1
|   |   |   |   |--- Ease_of_Online_Booking_Poor >  0.50
|   |   |   |   |   |--- Seat_Comfort_Extremely Poor <= 0.50
|   |   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |   |--- Legroom_Poor <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [218.00, 107.00] class: 0
|   |   |   |   |   |   |   |--- Legroom_Poor >  0.50
|   |   |   |   |   |   |   |   |--- weights: [338.00, 6.00] class: 0
|   |   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |   |   |   |   |--- Seat_Comfort_Extremely Poor >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 140.00] class: 1
|   |   |   |--- Ease_of_Online_Booking_Needs Improvement >  0.50
|   |   |   |   |--- Seat_Comfort_Needs Improvement <= 0.50
|   |   |   |   |   |--- Baggage_Handling_Needs Improvement <= 0.50
|   |   |   |   |   |   |--- Seat_Comfort_Extremely Poor <= 0.50
|   |   |   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [179.00, 82.00] class: 0
|   |   |   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 20.00] class: 1
|   |   |   |   |   |   |--- Seat_Comfort_Extremely Poor >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 30.00] class: 1
|   |   |   |   |   |--- Baggage_Handling_Needs Improvement >  0.50
|   |   |   |   |   |   |--- Legroom_Needs Improvement <= 0.50
|   |   |   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [85.00, 55.00] class: 0
|   |   |   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 38.00] class: 1
|   |   |   |   |   |   |--- Legroom_Needs Improvement >  0.50
|   |   |   |   |   |   |   |--- Cleanliness_Poor <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [6.00, 434.00] class: 1
|   |   |   |   |   |   |   |--- Cleanliness_Poor >  0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Seat_Comfort_Needs Improvement >  0.50
|   |   |   |   |   |--- Online_Support_Excellent <= 0.50
|   |   |   |   |   |   |--- Online_Boarding_Excellent <= 0.50
|   |   |   |   |   |   |   |--- CheckIn_Service_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [630.00, 48.00] class: 0
|   |   |   |   |   |   |   |--- CheckIn_Service_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 19.00] class: 1
|   |   |   |   |   |   |--- Online_Boarding_Excellent >  0.50
|   |   |   |   |   |   |   |--- Arrival_Time_Convenient_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 27.00] class: 1
|   |   |   |   |   |   |   |--- Arrival_Time_Convenient_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 1.00] class: 0
|   |   |   |   |   |--- Online_Support_Excellent >  0.50
|   |   |   |   |   |   |--- Travel_Distance <= 204.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Travel_Distance >  204.50
|   |   |   |   |   |   |   |--- Arrival_Time_Convenient_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 56.00] class: 1
|   |   |   |   |   |   |   |--- Arrival_Time_Convenient_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 3.00] class: 1
|   |   |--- Catering_Good >  0.50
|   |   |   |--- Travel_Class_Eco <= 0.50
|   |   |   |   |--- Seat_Comfort_Good <= 0.50
|   |   |   |   |   |--- Arrival_Time_Convenient_Good <= 0.50
|   |   |   |   |   |   |--- Ease_of_Online_Booking_Poor <= 0.50
|   |   |   |   |   |   |   |--- Ease_of_Online_Booking_Good <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [51.00, 143.00] class: 1
|   |   |   |   |   |   |   |--- Ease_of_Online_Booking_Good >  0.50
|   |   |   |   |   |   |   |   |--- weights: [6.00, 171.00] class: 1
|   |   |   |   |   |   |--- Ease_of_Online_Booking_Poor >  0.50
|   |   |   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Arrival_Time_Convenient_Good >  0.50
|   |   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |   |--- Online_Support_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [305.00, 14.00] class: 0
|   |   |   |   |   |   |   |--- Online_Support_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [6.00, 12.00] class: 1
|   |   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |   |--- Travel_Distance <= 3736.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 41.00] class: 1
|   |   |   |   |   |   |   |--- Travel_Distance >  3736.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Seat_Comfort_Good >  0.50
|   |   |   |   |   |--- Customer_Type_Loyal Customer <= 0.50
|   |   |   |   |   |   |--- Age <= 24.50
|   |   |   |   |   |   |   |--- Onboard_Service_Poor <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [17.00, 286.00] class: 1
|   |   |   |   |   |   |   |--- Onboard_Service_Poor >  0.50
|   |   |   |   |   |   |   |   |--- weights: [6.00, 4.00] class: 0
|   |   |   |   |   |   |--- Age >  24.50
|   |   |   |   |   |   |   |--- Age <= 30.50
|   |   |   |   |   |   |   |   |--- weights: [116.00, 223.00] class: 1
|   |   |   |   |   |   |   |--- Age >  30.50
|   |   |   |   |   |   |   |   |--- weights: [251.00, 235.00] class: 0
|   |   |   |   |   |--- Customer_Type_Loyal Customer >  0.50
|   |   |   |   |   |   |--- Platform_Location_Manageable <= 0.50
|   |   |   |   |   |   |   |--- Type_Travel_Personal Travel <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [163.00, 2026.00] class: 1
|   |   |   |   |   |   |   |--- Type_Travel_Personal Travel >  0.50
|   |   |   |   |   |   |   |   |--- weights: [56.00, 87.00] class: 1
|   |   |   |   |   |   |--- Platform_Location_Manageable >  0.50
|   |   |   |   |   |   |   |--- Arrival_Delay_in_Mins <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [27.00, 23.00] class: 0
|   |   |   |   |   |   |   |--- Arrival_Delay_in_Mins >  0.50
|   |   |   |   |   |   |   |   |--- weights: [30.00, 9.00] class: 0
|   |   |   |--- Travel_Class_Eco >  0.50
|   |   |   |   |--- Ease_of_Online_Booking_Good <= 0.50
|   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |--- Seat_Comfort_Good <= 0.50
|   |   |   |   |   |   |   |--- Gender_Male <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [209.00, 48.00] class: 0
|   |   |   |   |   |   |   |--- Gender_Male >  0.50
|   |   |   |   |   |   |   |   |--- weights: [284.00, 6.00] class: 0
|   |   |   |   |   |   |--- Seat_Comfort_Good >  0.50
|   |   |   |   |   |   |   |--- Gender_Male <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [1117.00, 844.00] class: 0
|   |   |   |   |   |   |   |--- Gender_Male >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1336.00, 502.00] class: 0
|   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |--- Travel_Distance <= 2890.50
|   |   |   |   |   |   |   |--- Cleanliness_Good <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 88.00] class: 1
|   |   |   |   |   |   |   |--- Cleanliness_Good >  0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 25.00] class: 1
|   |   |   |   |   |   |--- Travel_Distance >  2890.50
|   |   |   |   |   |   |   |--- Arrival_Delay_in_Mins <= 28.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Arrival_Delay_in_Mins >  28.00
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Ease_of_Online_Booking_Good >  0.50
|   |   |   |   |   |--- Platform_Location_Manageable <= 0.50
|   |   |   |   |   |   |--- Arrival_Time_Convenient_Good <= 0.50
|   |   |   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [274.00, 210.00] class: 0
|   |   |   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 20.00] class: 1
|   |   |   |   |   |   |--- Arrival_Time_Convenient_Good >  0.50
|   |   |   |   |   |   |   |--- Customer_Type_Loyal Customer <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [63.00, 41.00] class: 0
|   |   |   |   |   |   |   |--- Customer_Type_Loyal Customer >  0.50
|   |   |   |   |   |   |   |   |--- weights: [238.00, 637.00] class: 1
|   |   |   |   |   |--- Platform_Location_Manageable >  0.50
|   |   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |   |--- Age <= 16.50
|   |   |   |   |   |   |   |   |--- weights: [47.00, 4.00] class: 0
|   |   |   |   |   |   |   |--- Age >  16.50
|   |   |   |   |   |   |   |   |--- weights: [223.00, 103.00] class: 0
|   |   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|--- Onboard_Entertainment_Excellent >  0.50
|   |--- Type_Travel_Personal Travel <= 0.50
|   |   |--- Customer_Type_Loyal Customer <= 0.50
|   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |--- Travel_Class_Eco <= 0.50
|   |   |   |   |   |--- Age <= 31.00
|   |   |   |   |   |   |--- Travel_Distance <= 3174.00
|   |   |   |   |   |   |   |--- Departure_Delay_in_Mins <= 165.00
|   |   |   |   |   |   |   |   |--- weights: [7.00, 60.00] class: 1
|   |   |   |   |   |   |   |--- Departure_Delay_in_Mins >  165.00
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- Travel_Distance >  3174.00
|   |   |   |   |   |   |   |--- Platform_Location_Manageable <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Platform_Location_Manageable >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Age >  31.00
|   |   |   |   |   |   |--- Online_Support_Good <= 0.50
|   |   |   |   |   |   |   |--- Legroom_Needs Improvement <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [37.00, 7.00] class: 0
|   |   |   |   |   |   |   |--- Legroom_Needs Improvement >  0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 4.00] class: 1
|   |   |   |   |   |   |--- Online_Support_Good >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Travel_Class_Eco >  0.50
|   |   |   |   |   |--- Cleanliness_Excellent <= 0.50
|   |   |   |   |   |   |--- Baggage_Handling_Excellent <= 0.50
|   |   |   |   |   |   |   |--- Platform_Location_Inconvenient <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [110.00, 7.00] class: 0
|   |   |   |   |   |   |   |--- Platform_Location_Inconvenient >  0.50
|   |   |   |   |   |   |   |   |--- weights: [7.00, 5.00] class: 0
|   |   |   |   |   |   |--- Baggage_Handling_Excellent >  0.50
|   |   |   |   |   |   |   |--- Platform_Location_Manageable <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [5.00, 9.00] class: 1
|   |   |   |   |   |   |   |--- Platform_Location_Manageable >  0.50
|   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |--- Cleanliness_Excellent >  0.50
|   |   |   |   |   |   |--- Onboard_Wifi_Service_Good <= 0.50
|   |   |   |   |   |   |   |--- Legroom_Poor <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [8.00, 4.00] class: 0
|   |   |   |   |   |   |   |--- Legroom_Poor >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Onboard_Wifi_Service_Good >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |--- weights: [0.00, 1318.00] class: 1
|   |   |--- Customer_Type_Loyal Customer >  0.50
|   |   |   |--- Ease_of_Online_Booking_Poor <= 0.50
|   |   |   |   |--- Legroom_Extremely Poor <= 0.50
|   |   |   |   |   |--- Travel_Class_Eco <= 0.50
|   |   |   |   |   |   |--- Legroom_Poor <= 0.50
|   |   |   |   |   |   |   |--- Age <= 60.50
|   |   |   |   |   |   |   |   |--- weights: [11.00, 11450.00] class: 1
|   |   |   |   |   |   |   |--- Age >  60.50
|   |   |   |   |   |   |   |   |--- weights: [3.00, 235.00] class: 1
|   |   |   |   |   |   |--- Legroom_Poor >  0.50
|   |   |   |   |   |   |   |--- Travel_Distance <= 915.50
|   |   |   |   |   |   |   |   |--- weights: [4.00, 8.00] class: 1
|   |   |   |   |   |   |   |--- Travel_Distance >  915.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 176.00] class: 1
|   |   |   |   |   |--- Travel_Class_Eco >  0.50
|   |   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |   |--- Departure_Delay_in_Mins <= 129.50
|   |   |   |   |   |   |   |   |--- weights: [45.00, 584.00] class: 1
|   |   |   |   |   |   |   |--- Departure_Delay_in_Mins >  129.50
|   |   |   |   |   |   |   |   |--- weights: [10.00, 3.00] class: 0
|   |   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1825.00] class: 1
|   |   |   |   |--- Legroom_Extremely Poor >  0.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- Ease_of_Online_Booking_Poor >  0.50
|   |   |   |   |--- Travel_Class_Eco <= 0.50
|   |   |   |   |   |--- Travel_Distance <= 272.00
|   |   |   |   |   |   |--- Seat_Comfort_Extremely Poor <= 0.50
|   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |--- Seat_Comfort_Extremely Poor >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |--- Travel_Distance >  272.00
|   |   |   |   |   |   |--- weights: [0.00, 135.00] class: 1
|   |   |   |   |--- Travel_Class_Eco >  0.50
|   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |--- Seat_Comfort_Extremely Poor <= 0.50
|   |   |   |   |   |   |   |--- Onboard_Wifi_Service_Good <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [15.00, 2.00] class: 0
|   |   |   |   |   |   |   |--- Onboard_Wifi_Service_Good >  0.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 6.00] class: 1
|   |   |   |   |   |   |--- Seat_Comfort_Extremely Poor >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |--- Onboard_Wifi_Service_Good <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 42.00] class: 1
|   |   |   |   |   |   |--- Onboard_Wifi_Service_Good >  0.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |--- Type_Travel_Personal Travel >  0.50
|   |   |--- Gender_Male <= 0.50
|   |   |   |--- Arrival_Delay_in_Mins <= 131.00
|   |   |   |   |--- Seat_Comfort_Good <= 0.50
|   |   |   |   |   |--- Arrival_Time_Convenient_Excellent <= 0.50
|   |   |   |   |   |   |--- Travel_Distance <= 5727.00
|   |   |   |   |   |   |   |--- Departure_Delay_in_Mins <= 128.50
|   |   |   |   |   |   |   |   |--- weights: [11.00, 2584.00] class: 1
|   |   |   |   |   |   |   |--- Departure_Delay_in_Mins >  128.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Travel_Distance >  5727.00
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Arrival_Time_Convenient_Excellent >  0.50
|   |   |   |   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |   |   |   |--- Seat_Comfort_Extremely Poor <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [72.00, 25.00] class: 0
|   |   |   |   |   |   |   |--- Seat_Comfort_Extremely Poor >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 48.00] class: 1
|   |   |   |   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 739.00] class: 1
|   |   |   |   |--- Seat_Comfort_Good >  0.50
|   |   |   |   |   |--- Arrival_Time_Convenient_Excellent <= 0.50
|   |   |   |   |   |   |--- Platform_Location_Manageable <= 0.50
|   |   |   |   |   |   |   |--- Platform_Location_Inconvenient <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [58.00, 652.00] class: 1
|   |   |   |   |   |   |   |--- Platform_Location_Inconvenient >  0.50
|   |   |   |   |   |   |   |   |--- weights: [29.00, 42.00] class: 1
|   |   |   |   |   |   |--- Platform_Location_Manageable >  0.50
|   |   |   |   |   |   |   |--- Departure_Delay_in_Mins <= 49.50
|   |   |   |   |   |   |   |   |--- weights: [36.00, 44.00] class: 1
|   |   |   |   |   |   |   |--- Departure_Delay_in_Mins >  49.50
|   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |--- Arrival_Time_Convenient_Excellent >  0.50
|   |   |   |   |   |   |--- Platform_Location_Very Convenient <= 0.50
|   |   |   |   |   |   |   |--- Travel_Distance <= 960.00
|   |   |   |   |   |   |   |   |--- weights: [48.00, 23.00] class: 0
|   |   |   |   |   |   |   |--- Travel_Distance >  960.00
|   |   |   |   |   |   |   |   |--- weights: [27.00, 35.00] class: 1
|   |   |   |   |   |   |--- Platform_Location_Very Convenient >  0.50
|   |   |   |   |   |   |   |--- Online_Support_Excellent <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 18.00] class: 1
|   |   |   |   |   |   |   |--- Online_Support_Excellent >  0.50
|   |   |   |   |   |   |   |   |--- weights: [6.00, 5.00] class: 0
|   |   |   |--- Arrival_Delay_in_Mins >  131.00
|   |   |   |   |--- weights: [57.00, 0.00] class: 0
|   |   |--- Gender_Male >  0.50
|   |   |   |--- Seat_Comfort_Excellent <= 0.50
|   |   |   |   |--- Seat_Comfort_Extremely Poor <= 0.50
|   |   |   |   |   |--- Seat_Comfort_Good <= 0.50
|   |   |   |   |   |   |--- weights: [284.00, 0.00] class: 0
|   |   |   |   |   |--- Seat_Comfort_Good >  0.50
|   |   |   |   |   |   |--- Onboard_Wifi_Service_Good <= 0.50
|   |   |   |   |   |   |   |--- Arrival_Delay_in_Mins <= 5.50
|   |   |   |   |   |   |   |   |--- weights: [37.00, 27.00] class: 0
|   |   |   |   |   |   |   |--- Arrival_Delay_in_Mins >  5.50
|   |   |   |   |   |   |   |   |--- weights: [25.00, 3.00] class: 0
|   |   |   |   |   |   |--- Onboard_Wifi_Service_Good >  0.50
|   |   |   |   |   |   |   |--- Travel_Distance <= 5287.00
|   |   |   |   |   |   |   |   |--- weights: [40.00, 3.00] class: 0
|   |   |   |   |   |   |   |--- Travel_Distance >  5287.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Seat_Comfort_Extremely Poor >  0.50
|   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |   |--- Seat_Comfort_Excellent >  0.50
|   |   |   |   |--- weights: [0.00, 462.00] class: 1

In [284]:
# Importance of features in the tree building

feature_names = list(X_train.columns)
importances = model_dt.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(10, 15))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Observation:

The baseline model has found the following features to be the top 5 most important influencing overall experience:

  • Onboard_Entertainment_Excellent
  • Onboard_Entertainment_Good
  • Seat_Comfort_Excellent
  • Seat_Comfort_Extremely_Poor
  • Seat_Comfort_Good

Motivation for tuning the hyperparemeters using GridSearch CV:

To see if the model performance can be improved, we will adjust its hyperparameters using GridSearch CV. The algorithm finds the optimal values for the hyperparameters (e.g., tree depth, minimum samples) increasing the generalization i.e., predictive power and preventing overfitting.

What about pruning the tree?

Since the Decision Tree is not overfitting on the training dataset, pruning the tree could lead to misclassification as it tends to introduce information loss through the removal of nodes or features. The effect being a reduction in the quality of the classifier.

In [287]:
# Choosing the type of classifier
dtree_estimator = DecisionTreeClassifier(class_weight='balanced', random_state=1)

# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 7), 
              'criterion': ['gini', 'entropy'],
              'min_samples_leaf': [5, 10, 20, 25]
             }

# Type of scoring used to compare parameter combinations
scorer = make_scorer(accuracy_score)

# Run the grid search
gridCV = GridSearchCV(dtree_estimator, parameters, scoring = scorer, cv = 5)

# Fitting the grid search on the train data
gridCV = gridCV.fit(X_train, Y_train)

# Set the classifier to the best combination of parameters
dtree_estimator = gridCV.best_estimator_

# Fit the best estimator to the data
dtree_estimator.fit(X_train, Y_train)
Out[287]:
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=6, min_samples_leaf=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=6, min_samples_leaf=5, random_state=1)
In [289]:
# Checking performance on the training dataset
y_train_pred_dt = dtree_estimator.predict(X_train)

metrics_score(Y_train, y_train_pred_dt)
              precision    recall  f1-score   support

           0       0.82      0.93      0.88     42786
           1       0.94      0.84      0.88     51593

    accuracy                           0.88     94379
   macro avg       0.88      0.88      0.88     94379
weighted avg       0.89      0.88      0.88     94379

No description has been provided for this image
In [291]:
# Checking performance on the test dataset
y_test_pred_dt = dtree_estimator.predict(X_test)

# metrics_score(y_test, y_test_pred_dt)
In [293]:
y_test_pred_dt
Out[293]:
array([1, 1, 1, ..., 0, 1, 0])
In [295]:
# Exporting the prediction on the test dataset as .csv

Prediction_4 = pd.DataFrame(pred_test_dt)

Submission_4 = pc_df = pd.concat([df_test['ID'],Prediction_4], axis=1)

Submission_4.columns=['ID', 'Overall_Experience']

# Saving the DataFrame as a CSV file 
# Submission_4.to_csv('Hackathon_Submission_4.csv', index = False)

# Checking the dataframe
Submission_4.head()
Out[295]:
ID Overall_Experience
0 99900001 1
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 1

Observations:

Compared to the model with the default hyperparameter values, tuning has actually reduced accuracy by 0.02. Another ML model is therefore required to try find better performance.

4) Random Forest¶

In [297]:
# Fitting the Random Forest classifier on the training data
rf_estimator = RandomForestClassifier(random_state = 1)

rf_estimator.fit(X_train, Y_train)
Out[297]:
RandomForestClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=1)
In [299]:
# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(X_train)

metrics_score(Y_train, y_pred_train_rf)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     42786
           1       1.00      1.00      1.00     51593

    accuracy                           1.00     94379
   macro avg       1.00      1.00      1.00     94379
weighted avg       1.00      1.00      1.00     94379

No description has been provided for this image

Observation:

For all the metrics in the training dataset, the Random Forest gives a 100% score on all the metrics. This indicates that the model is overfitting on the train dataset.

In [301]:
# Checking performance on the testing data
y_pred_test_rf = rf_estimator.predict(X_test)

# metrics_score(y_test, y_pred_test_rf)
In [303]:
y_pred_test_rf
Out[303]:
array([1, 1, 1, ..., 0, 1, 0])
In [305]:
# Exporting the prediction on the test dataset as .csv

Prediction_5 = pd.DataFrame(y_pred_test_rf)

Submission_5 = pc_df = pd.concat([df_test['ID'],Prediction_5], axis=1)

Submission_5.columns=['ID', 'Overall_Experience']

# Saving the DataFrame as a CSV file 
# Submission_5.to_csv('Hackathon_Submission_5.csv', index = False)

# Checking the dataframe
Submission_5.head()
Out[305]:
ID Overall_Experience
0 99900001 1
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 1

Let's check the feature importances of the Random Forest

In [307]:
importances = rf_estimator.feature_importances_

columns = X_train.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (10, 15))

sns.barplot(x = importance_df.Importance, y = importance_df.index)
Out[307]:
<Axes: xlabel='Importance'>
No description has been provided for this image

Observations:

The Random Forest finds the following features as the top five important in determining overall experience:

  • Onboard_Entertainment_Excellent
  • Seat_Comfort_Excellent
  • Onboard_Entertainment_Good
  • Travel_Class_Eco
  • Customer_Type_Loyal_Customer

The interpretation is that customers who are predicted to and actually rate overall experience 1 are those who rate the top five important ratings accordingly. It may be further instructive to reduce the training dataset to the most important features e.g., the top 20 - 30 features to make the training more efficient however as this may lead to information loss or overfitting, there may be undesirable costs. Alterntatively we may consider reducing the dimensionality of the data using principal component analysis (PCA) to assist with efficiency. Using PCA is howvever likely to lead to difficulty in interpreting the model results.

Since our project goal is to get the highest possible accuracy, we will tune the random forest hyperparemeters and see if the overfitting seen in the base model is corrected.

In [309]:
# Specifying an alternative Random Forest using random search with some hyperparemeters set 

alt_rf_estimator = RandomForestClassifier(n_estimators=220, max_depth=20, max_features=.75, random_state=100)

alt_rf_estimator.fit(X_train, Y_train)
Out[309]:
RandomForestClassifier(max_depth=20, max_features=0.75, n_estimators=220,
                       random_state=100)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=20, max_features=0.75, n_estimators=220,
                       random_state=100)
In [311]:
# Checking performance on the training data
y_pred_train_rf_alt = alt_rf_estimator.predict(X_train)

metrics_score(Y_train, y_pred_train_rf_alt)
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     42786
           1       1.00      0.99      0.99     51593

    accuracy                           0.99     94379
   macro avg       0.99      0.99      0.99     94379
weighted avg       0.99      0.99      0.99     94379

No description has been provided for this image
In [313]:
# Checking performance on the testing data
y_pred_test_rf_alt = alt_rf_estimator.predict(X_test)

# metrics_score(y_test, y_pred_test_rf_alt)
In [315]:
y_pred_test_rf_alt
Out[315]:
array([1, 1, 1, ..., 1, 1, 0])
In [317]:
# Exporting the prediction on the test dataset as .csv

Prediction_6 = pd.DataFrame(y_pred_test_rf_alt)

Submission_6 = pc_df = pd.concat([df_test['ID'],Prediction_6], axis=1)

Submission_6.columns=['ID', 'Overall_Experience']

# Saving the DataFrame as a CSV file 
# Submission_6.to_csv('Hackathon_Submission_6.csv', index = False)

# Checking the dataframe
Submission_6.head()
Out[317]:
ID Overall_Experience
0 99900001 1
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 1

Observations:

The model accuracy is at 99%. This indicates that it may be overfitting on the train dataset, however the metric is better than the base model. We will re-check the feature importances using the adjusted Random Forest model and compare the results.

In [319]:
importances = alt_rf_estimator.feature_importances_

columns = X_train.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (10, 15))

sns.barplot(x = importance_df.Importance, y = importance_df.index)
Out[319]:
<Axes: xlabel='Importance'>
No description has been provided for this image

Observations:

The base Random Forest found the following order of features as determining overall experience

  • Onboard_Entertainment_Excellent
  • Seat_Comfort_Excellent
  • Onboard_Entertainment_Good
  • Travel_Class_Eco
  • Customer_Type_Loyal_Customer

The alternative Random Forest with hyperparameters set as above, has found the following order instead:

  • Onboard Entertainment Excellent
  • Onboard Entertainment Good
  • Seat Comfort Excellent
  • Seat_Comfort_ Extremely Poor
  • Seat Comfort Good

The difference in feature importances shows that even without reducing the training dataset to the most important features or reducing the dimensionality of the data, the random forest performs better i.e., reduced overfitting with adjustment of hyperparemeters.

So far we have used Random search to filter the range for each hyperparameter. To find the specific combinations of hyperparameter settings to try for increased performance, we will use GridSearch CV. It checks all the combinations we set rather than sampling randomly from a distribution.

In [321]:
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(class_weight = 'balanced', random_state = 20)

# Grid of parameters to choose from
params_rf = {  
        "n_estimators": [200, 300],
        "min_samples_leaf": np.arange(1, 4, 1),
        "max_features": [0.8, 0.9, 'auto'],
}

# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(X_train, Y_train)

# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
In [323]:
# Fitting the tuned model
rf_estimator_tuned.fit(X_train, Y_train)
Out[323]:
RandomForestClassifier(class_weight='balanced', max_features=0.8,
                       n_estimators=300, random_state=20)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', max_features=0.8,
                       n_estimators=300, random_state=20)
In [331]:
# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(X_train)

metrics_score(Y_train, y_pred_train_rf_tuned)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     42786
           1       1.00      1.00      1.00     51593

    accuracy                           1.00     94379
   macro avg       1.00      1.00      1.00     94379
weighted avg       1.00      1.00      1.00     94379

No description has been provided for this image

Observations:

The tuned Random Forest Model has returned an accuracy measure of 100% on all the metrics. The model is clearly overfitting on the train data.

In [333]:
# Checking performance on the testing data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(X_test)

# metrics_score(y_test, y_pred_test_rf)
In [335]:
y_pred_test_rf_tuned
Out[335]:
array([1, 1, 1, ..., 1, 1, 0])
In [337]:
# Exporting the prediction on the test dataset as .csv

Prediction_6 = pd.DataFrame(y_pred_test_rf_tuned)

Submission_6 = pc_df = pd.concat([df_test['ID'],Prediction_6], axis=1)

Submission_6.columns=['ID', 'Overall_Experience']

# Saving the DataFrame as a CSV file 
# Submission_6.to_csv('Hackathon_Submission_6.csv', index = False)

# Checking the dataframe
Submission_6.head()
Out[337]:
ID Overall_Experience
0 99900001 1
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 1

5) AdaBoost¶

In [339]:
from sklearn.ensemble import AdaBoostClassifier
In [358]:
# Initializing the AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator=dtree_estimator, n_estimators=50, learning_rate=1.0, random_state=42)

# Fit the model to the training data
ada.fit(X_train, Y_train)

# Predict on the train set
y_pred_ada_train = ada.predict(X_train)

# Evaluate the model
metrics_score(Y_train, y_pred_ada_train)
              precision    recall  f1-score   support

           0       0.97      0.98      0.97     42786
           1       0.98      0.97      0.98     51593

    accuracy                           0.97     94379
   macro avg       0.97      0.98      0.97     94379
weighted avg       0.97      0.97      0.97     94379

No description has been provided for this image

Observations: The AdaBoost model is returning an accuracy score of 0.97 on the train data set. This means the predictive power on unseen data should be higher than the benchmark of 95%. Let us run the prediction on the test data.

In [360]:
# Predict on the train set
y_pred_ada_test = ada.predict(X_test)
In [362]:
y_pred_ada_test
Out[362]:
array([1, 1, 1, ..., 1, 1, 0], dtype=int64)
In [364]:
# Exporting the prediction on the test dataset as .csv

Prediction_7 = pd.DataFrame(y_pred_ada_test)

Submission_7 = pc_df = pd.concat([df_test['ID'],Prediction_7], axis=1)

Submission_7.columns=['ID', 'Overall_Experience']

# Saving the DataFrame as a CSV file 
# Submission_14.to_csv('Hackathon_Submission_14.csv', index = False)

# Checking the dataframe
Submission_7.head()
Out[364]:
ID Overall_Experience
0 99900001 1
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 1

Let us use AdaBoost with the Random Forest classifier

In [356]:
# Initializing the AdaBoostClassifier
ada_rf = AdaBoostClassifier(base_estimator=rf_estimator_tuned, n_estimators=75, learning_rate=1.0, random_state=64)

# Fit the model to the training data
ada.fit(X_train, Y_train)

# Predict on the train set
y_pred_ada_train = ada.predict(X_train)

# Evaluate the model
metrics_score(Y_train, y_pred_ada_train)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     42786
           1       1.00      1.00      1.00     51593

    accuracy                           1.00     94379
   macro avg       1.00      1.00      1.00     94379
weighted avg       1.00      1.00      1.00     94379

No description has been provided for this image

Let us tune the hyperparemeters of the model and see if there is an improvement.

In [236]:
# Predict on the test set
y_pred_ada_test = ada.predict(X_test)
In [237]:
y_pred_ada_test
Out[237]:
array([1, 1, 1, ..., 1, 1, 0])
In [238]:
# Exporting the prediction on the test dataset as .csv

Prediction_8 = pd.DataFrame(y_pred_ada_test)

Submission_8 = pc_df = pd.concat([X_test['ID'],Prediction_16], axis=1)

Submission_8.columns=['ID', 'Overall_Experience']

# Saving the DataFrame as a CSV file 
Submission_8.to_csv('Hackathon_Submission_16.csv', index = False)

# Checking the dataframe
Submission_8.head()
Out[238]:
ID Overall_Experience
0 99900001 1
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 1

6) XGBoost¶

In [239]:
import xgboost as xgb
In [240]:
# Initialize the XGBoost classifier
xgb_clf = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Fit the model to the training data
xgb_clf.fit(X_train, Y_train)
Out[240]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=42, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=42, ...)
In [242]:
# Predict on the train set
y_pred_xg_train = xgb_clf.predict(X_train)

# Evaluate the model
metrics_score(Y_train, y_pred_xg_train)
              precision    recall  f1-score   support

           0       0.96      0.97      0.97     42786
           1       0.98      0.97      0.97     51593

    accuracy                           0.97     94379
   macro avg       0.97      0.97      0.97     94379
weighted avg       0.97      0.97      0.97     94379

No description has been provided for this image
In [243]:
# Predict on the test set
y_pred_xg_test = xgb_clf.predict(X_test)
In [244]:
y_pred_xg_test
Out[244]:
array([1, 1, 1, ..., 1, 1, 0])
In [246]:
# Exporting the prediction on the test dataset as .csv

Prediction_9 = pd.DataFrame(y_pred_xg_test)

Submission_9 = pc_df = pd.concat([X_test['ID'],Prediction_17], axis=1)

Submission_9.columns=['ID', 'Overall_Experience']

# Saving the DataFrame as a CSV file 
Submission_9.to_csv('Hackathon_Submission_17.csv', index = False)

# Checking the dataframe
Submission_9.head()
Out[246]:
ID Overall_Experience
0 99900001 1
1 99900002 1
2 99900003 1
3 99900004 0
4 99900005 1

Conclusions & Recommendations:¶

The Random Forest with tuned hyperparameters, Adaboost with Decision Tree estimator and XGBoost models had the highest levels of accuracy in estimating the overall experience of passengers on the Shinkansen Bullet Train. Since the Random Forest shows possible overfitting compared to the AdaBoost and XGBoost, the latter would be more reliable in a production environment and therefore should be given preference. The tuned random forest model shows the following features as being the most important in determining overall customer experience: Onboard Entertainment Excellent, Onboard Entertainment Good, Seat Comfort Excellent, Seat_Comfort_ Extremely Poor and Seat Comfort Good. Efforts to improve the quality of the customers' experience in these dimensions would therefore have a positive impact on the customer satisfaction.

In [ ]: